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PREFACE 



This book contains the papers that have been presented at the ninth 
Very Large Scale Integrated Systems conference VLSI’97 that is 
organized biannually by IFIP Working Group 10.5. It took place at 
Hotel Serra Azul, in Gramado Brazil from 26-30 August 1997. 
Previous conferences have taken place in Edinburgh, Trondheim, 
Vancouver, Munich, Grenoble and Tokyo. 

The papers in this book report on all aspects of importance to the 
design of the current and future integrated systems. The current trend 
towards the realization of versatile Systems-on-a-Chip require 
attention of embedded hardware/software systems, dedicated ASIC 
hardware, sensors and actuators, mixed analog/digital design, video 
and image processing, low power battery operation and wireless 
communication. The papers as presented in Xhis book have been 
organized in two tracks, where one is dealing with VLSI System 
Design and Applications and the other presents VLSI Design 
Methods and CAD. The following topics are addressed: 

VLSI System Design and Applications Track 

• VLSI for Video and Image Processing. 

• Microsystem and Mixed-mode design. 

• Communication And Memory System Design 

• Cow-voltage & Low-power Analog Circuits. 

• High Speed Circuit Techniques 

• Application Specific DSP Architectures. 

VLSI Design Methods and CAD Track 

• Specification and Simulation at System Level. 

• Synthesis and Technology Mapping. 

• CAD Techniques for Low-Power Design. 

• Physical Design Issues in Sub-micron Technologies. 

• Architectural Design and Synthesis. 

• Testing in Complex Mixed Analog and Digital Systems. 

We hereby would like to thank IFIP and more specifically IFIP 
TCIO and IFIP WG 10.5 for the support of this event, the 
researchers active in VLSI that contributed to the success of the 
conference and the reviewers that carefully selected and provided 
feedback for the papers. 



Luc Claesen 

Leuven (Belgium) June 1997 



Ricardo Reis 
Porto Alegre (Brazil) 
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A Low-Power H.263 Video CoDec 
Core Dedicated to Mobile Computing 



Morgan Hirosuke MIKI, 

Gen FUJITA, 

Takeshi KOBAYASHI, 

Takao ONOYE, and 
Isao SHIRAKAWA 

Dept. Information Systems Engineering 
Osaka University 

2-1 Yamada-Oka, Suita, Osaka, 565 Japan 
Phone; +81(6)879-7807, FAX: +81(6)875-5902 
e-mail: (miki, fujita, kobayash, onoe,sirakawa@ise.eng.osaka- 
u. ac.jp} 



Abstract 

A number of novel VLSI architectures are devised for an H.263 video codec core 
in terms of low bitrate visual communication. The potential of the practicability for 
mobile computing has been extremely explored by attempting not only to minimize 
the total chip area but also to reduce the power consumption to such an extent that 
the operation frequency can be slowed down to 15MHz. The whole encoding and 
decoding facilities have been integrated in the die area of 7.66 mn^ by means of a 
0.35m CMOS technology, with the dissipation of 146.60 mW from a single 3.3V 
supply. 



Keywords 

VLSI, video codec, low bitrate, H.263, low power 
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1 INTRODUCTION 

The H.324(ITU-T, 1995) international standard specifies the low bitrate 
audiovisual communication based on PSTN (Public Switched Telephone Network). 
The H.263(ITU-T, 1996) standard is a video version of this H.324, which is to 
compress the moving picture components of audiovisual services at low bitrates. 
Actually, by means of H.263, QCIF (176 x 144) 10 fps (frames/sec) pictures can be 
coded at V.34 (28.8Kbps), and moreover it should be added that at such a low 
bitrate the H.263 coding efficiency is superior to any of those of H.261 (ITU-T, 
1993) and MPEGl(ISO/IEC, 1993) . Thus various applications of this H.263 
standard are to be realized extensively in mobile computing, wireless multimedia 
communication, etc. In particular, portable multimedia facilities in the wireless 
environment can be regarded as the enormous potentialities of multimedia 
communications. 

The coding/encoding process of the H.263 standard may be implemented with the 
use of any of those multimedia enhanced DSPs(Brinthaupt, 1996)(Golston, 
1996)(Okamoto, 1996), which have been developed specifically for H.261 or 
MPEGl. However, in terms of the mobile and portable use, there still remains 
much room for reducing both the power consumption and chip area of the codec 
core. Thus there arises the big issue of how to develop the H.263 specific 
architectures of VLSI implementation especially for mobile computing. 

The present paper describes a number of VLSI architectures sophisticated for 
implementing an H.263 video codec dedicated principally to mobile computing. 
The main distinctive features of this codec consists in a large reduction not only in 
the total chip area but also in the power consumption to such an extent as to slow 
down the operation frequency to 15MHz. All of encoding/decoding facilities have 
been integrated in the area of 7.66 mm^ by a 0.35m triple-metal CMOS technology, 
with the total dissipation of 
146.60 mW from a single 3.3V supply. 



2 H.263 VIDEO CODING ALGORITHM 

The main encoding/decoding process of H.263 is the so-called MC-DCT coding, as 
shown in Figure 1, which is executed in the same manner as H.261 and MPEGl. 
The distinctive features of the H.263 standard lie in simple syntax, half-pel 
prediction, block-level motion estimation (advanced prediction mode), paired 
coding of the P-frame and the B-frame (i.e. the PB-frame mode), motion detection 
for the outside of frame (i.e. the unrestricted vector mode), SAC (syntax-based 
arithmetic coding mode), and so on. 

As can be seen from Figure 1, the picture coding can be achieved with the use of 
several functional units; [Motion Estimator (ME)], [Discrete Cosine Transformer 
(DCT)], [Quantizer (Q)], and [Variable Length Coder (VLC)]/[Syntax-based 
Arithmetic Coder (SAC)]. The bitstream decoding can be performed with the use of 




ai 



m 






5 lT»I«ii ; 



4M DRAM 



DCT/H 0/ MVLCA4LD 
IDCTW IQ MsAC/SADkh-h bitstream 



Figure 2 Organization of H.263 codec core 

Seeing that this H.263 video codec core is intended for the single chip 
implementation of a realtime H.324 audiovisual codec, innovations should be 
devised not only in reducing the area occupancy and the power dissipation but also 
in refining on the external memory size and the memory accessing bandwidth. The 
overall organization of our codec core is summarized in Figure 2, which is 
composed of a number of specific functional units. The main factor to achieve a 
high throughput at a low operation frequency can be attributed to the mechanism 
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that the I/O and processing conflicts can be mitigated at each stage of the codec. 
The detailed architecture of each functional unit is outlined in what follows. 

3.2 ME Core 

As to the so-called block-matching algorithm for ME (Motion Estimator), a number 
of authors(Chen, 1995)(Hayashi, 1995)(Onoye, 1995)(Chan, 1995)(Kim, 1995) 
have attempted to reduce the computational costs of the full-search, which is to 
detect a motion vector exhausitively within a search range by reference to MADs 
(Mean Absolute Differences). A compact ME core(Fujita, 1997) has been attained 
for H.263 by means of a sophisticated macroblock clustering algorithm(Onoye, 
1996), which has the following features; 

1 . high quality vectors, 

2. low computational costs, and 

3 . VLSI implementation capability. 




HP : Half Pel Calcualtor 
PE : Processing Element 
MB ; MacroblocK 



Figure 3 Organization of ME core The H.263 standard supports the PB-frame 

The organization of the ME core is illustrated in Figure 3, which consists of a one- 
dimensional PE (Processing Element) array, an accumulator for calculating 
macroblock vectors and block vectors, a macroblock buffer for the bi-directional 
prediction, and a half-pel calculator. 

Figure 4 indicates a block diagram of a PE, which adopts 8-bit and 12-bit 
datapath circuits. The reference pixel is to be broadcast to all PEs, and the 
prediction pixel is to be propagat^ from PE to PE. A PE outputs an MAD of 8 
pixels at every 8 cycles. 
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Figure 4 Block diagram of PE 

In addition to the normal macroblock prediction, the H.263 standard supports the 
advanced prediction mode, which is to detect motions of four blocks in a 
macroblock. In other words, as outlined in Figure 5, a macroblock can have either 
one macroblock vector or four block vectors. To cope with this, the organization of 
the accumulator is devised as illustrated in Figure 6. The MADs for macroblock 
and four blocks are calculated simultaneously by accumulating 8 pixels' MADs 
output from the PEs. 



Normal Prediction Advanced Prediction 
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Figure 5 Advanced prediction mode 



PEI PE2 - PEI 




Figure 6 Block diagram of accumulator 

The H.263 standard supports the PB-frame mode so that two frames (P-frame and 
B-frame) can be coded as one unit, and the ME core seeks the vectors for a pair of 
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macroblocks of these two frames simultaneously. Apart from MPEG, the motion 
estimation for the B-frame of H.263 requires the concurrent reference of two 
frames, since only one vector per marcoblock is used for the bi-directional 
prediction as illustrated in 

Figure 7. Therefore, the half-pel calculator determines the average of forward and 
backward reference pixel data, and feeds them to the PE array. The former 
reference pixel data are read from the external memory, and the later from the 
macroblock buffer. 



Forward Backward 




Figure 7 Bi-directional prediction of H.263 
3.2 DCT/IDCT Core 



The DCT has been successfully employed so far in a variety of algorithms(ITU-T, 
1993XITU-T, 1994KITU-T, 1993)(ISO/IEC, 1993) for the image compression to 
reduce the spational redundancy of picture sequence. 

The computational costs of the H.263 codec is lower than those of the MPEGl/2 
cores. The DCT/IDCT architectures, which have been developed for 
MPEGl/2(Matsui, 1994)(Uramoto, 1992)(Masaki, 1995), should not be employed 
for H.263 from the view point of hardware cost, and therefore in what follows a 
novel specific architecture is proposed for H.263. 

For implementing DCT/IDCT core the Chen's algorithm(Chen, 1977) (butterfly 
computation) is widely used in conjunction with a distributed arithmetic(Peled, 
1974). The Chen's algorithm can reduce the number of multiplications in 
DCT/IDCT by half. According to this algorithm, the 8 x 1 DCT and 8 x 1 IDCT 
are calculated by means of the following equations. 



DCT: 



X, 

^4 



1 






B C -C -B 






A -A -A A 




X^ +X 5 


C -B B -C 




3+^4. 



( 1 ) 
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ABACK, D E F G X, 

_l A C -A -B X, _l E -G -D -F X, 

~ 2 A -C -A B X, 2 F -D G E X, 



A -B A -C X. 



\G -F E -D X, 



B = cos—, C = sin—, D = cos—, 
8 8 16 




Figure 8 DCT/IDCT core by distributed arithmetic 



( 2 ) 



( 3 ) 



( 4 ) 



By means of the distributed arithmetic, as illustrated in Figure 8, each multiply 
accumulation can be executed with the use of accumulators and those ROMs which 
contain tables of products calculated in advance. However, this scheme requires 
additionally bit sheer and reorder buffer, and these units as well as ROMs occupy 
considerable area in the case of H.263. That is, to implement H.263 DCT/IDCT 
core, this overhead must be a serious obstacle. 

On the contrary, as illustrated in Figure 9, our DCT/IDCT is devised dedicatedly 
for H.263, where ROMs ROM A ~ ROM G and accumulators ACC 1 ~ ACC 4 
calculate each multiplication of equations (l)-(4) (i.e. A(jc„ + j:,) , A(j:, + ;c„) , ..., 
AX„ , BXj , ..., etc.), 4 bits at a time. As a result, the reorder buffer can be 
removed, and the number of ROMs can be reduced without degrading a 
performance. 
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Figure 9 Block diagram of proposed DCT/IDCT core 



3.4 Q/IQ Core 



The operations of Q and IQ are simply divisions and multiplicatons. To reduce the 
area occupancy, our Q is implemented by means of a 2-bit sequential non- 
restorning divider and IQ by means of a radix-4 sequential Booth's multiplier 
without parallel/array facilities, as illustrated in Figure 10. Both of these Q and IQ 
can output the division/multiplication result at every 4 cycles. 




Figure 10 Block diagram of Q/IQ core 

3.5 VLCA^LD and SAC/SAD Core 

In addition to VLC, the H.263 standard supports SAC, which is based on arithmetic 
operations and table index search. In order to achieve high coding efficiency, either 
of those two encoding modes is choosen picture by picture. 

Figure 1 1 shows a block diagram of the coding core. In both of two coding modes, 
a set of run, level, and last is indicated by an index, and then the index is coded to 
bitstream. Thus, it turns out that the index generator can be shared by different 
coding modes. As for the VLC table, the compressing mechanism proposed by 
(Tanaka, 1995) is employed to reduce the table size. The arithmetic unit calculates 
arithmetic operations required for SAC and SAD. 
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Figure 11 Block diagram of VLCAHbD and SAC/SAD core 



4 IMPLEMENTATION RESULTS 



A set of sophisticated architectures so far outlined are implemented through the use 
of ASIC design system COMPASS Design Tools ver. 9. This system allows the top 
down design of high-performance ASIC from the hybrid design entry of 
circuit/datapath schematic and HDL. For one implementation, the high-speed 
datapaths are used in the ME and the DCT/IDCT core, which require considerable 
computation. The standard cell logic blocks for controllers and ROM tables are 
synthesised from VHDL descriptions. It should be added in orther that the 
operation frequency slows down to 15MHz, the total power dissipation is reduced 
to 146.60 mW, and hence the core can be of portable use. Table 1 indicates the 
main chip features of the codec core, and Figure 12 shows the layout patterns 
obtained by a 0.35 m triple-metal technology. 



Table 1 Main chip feature of codec core 



Technology 
Chip size 
Transistors 
Clock frequency 
Power Dissipation 
Support picture 
Encoding Options 



0.35m CMOS triple-level Al. 

3.39 mm x 2.26 mm 

187,266 

15.0 MHz 

146.60 mW (3.3 V, 15.0 MHz) 

QCIF(176 X 144), sub-QCIF(128 x 96) lOfps 
advanced prediction mode, PB-frame mode, 
unrestricted vector mode, syntax-based arithmetic 
coding mode 
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Figure 12 Layout pattern of H.263 codec core (3.39 x 2.26 mm^) 



5 CONCLUSION 

This paper has outlined a sophisticated set of VLSI architectures for an H.263 
codec core, dedicated to mobile computing. Specifically, the ME core can treat 
various encoding options, and multipliers and dividers at a low operational 
frequency are employed in the Q/IQ core and the VLCAT.D and SAC/SAD core. 

Development is continuing on an integrated set of architectures for the single chip 
implementation of H.324 audiovisual communication. 
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Abstract 

Image recognition may consist of three main steps: edge detection, edge link- 
ing, and object /template matching. Edge detection algorithms usually pro- 
duce thin edges with discontinuities. In this paper, a real-time algorithm and 
its VLSI implementation for linking broken edges is presented. First, all bro- 
ken edge points inside a 12x12 moving window are identified. The 12x12 
window scans the input gray level edge map converting it into three levels of 
intensities using two threshold values. Decisions of linking the stronger edge 
break points are made based on their directions and/or with the guidance of 
the weak edge lines. The proposed VLSI architecture is capable of running 
the proposed edge linking algorithm in real-time outputting one pixel of the 
linked edge map per clock cycle with a latency of lln+12 clock cycles, where 
n is the number of pixel columns in the image. 



Keywords 

VLSI, Edge Linking, Real-Time Image Processing 



1 INTRODUCTION 

One of the most important features in an image for object recognition is edge 
information (Marr, et. al. 1980). Experiments in the past have shown that 
human visual system reacts strongly to sharp changes in pixel intensity in an 
image. Robot systems and many other computer vision applications, as well, 
are based on using the edge information to extract certain features within the 
environment (William, et. al. 1989). 

An edge point, then, is a sudden change in the intensity level of an image 
over a number of pixels either in a horizontal and/or a vertical directions 
(Alzahrani, et. al. 1997, Ungureanu, et. al. 1993, Farag, et. al. 1991, Farag, 
et. al. 1995, Xie 1992). But, edges are not all the same. Some of them are 
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much sharper, or stronger, than the others depending on the gray levels of 
the adjacent areas. 

Edge detection algorithms were developed in the past to produce high qual- 
ity edge maps. The focus of most of these algorithms is to ensure good quality 
of the edge output regardless of time consumed to produce them. Besides, 
edge detectors based on the first derivative do not guarantee to produce edge 
maps with continuous edge contours nor unwanted branches (William, et. al 
1989, Breen, et al 1994, Bernand 1994, Diamond 1983). Edge detectors based 
on the second derivatives, such as zero-crossing, suffer wrong edge detection in 
textured images and noisy images. Many object recognition algorithms prefer 
closed contours of objects for comparisons and matching. Thus, the output 
edge maps produced by the first type edge detectors may not be useful in 
follow-up processing steps. 

In this paper, we propose a real-time edge linking algorithm and its VLSI 
architecture that is capable of producing binary edge maps in a rate up to 
28.5MHz clock frequency. Our algorithm employs a moving window of size 
12x12 pixels that scans the input images. For a VGA sized image, only 12 
shift registers of 480 cells are needed to store 12 rows of the input image at a 
time. The delay time of producing edge maps of different input frames is zero 
and only a latency of 12x480 clock cycle presents. 

2 EXISTING EDGE LINKING ALGORITHMS 

William and Shah (William, et al 1989) proposed a Multiple Scale edge link- 
ing algorithm such that all edge points produced by the Canny edge detector 
(Canny 1986) are stacked in a queue. The output edge map produced by the 
Multiple Scale algorithm are of high quality and the edge contours are closed 
and connected. However, the complexity of the algorithm is at the order of 
3”, where n is the number of the image pixels. An edge linking algorithm 
using a Causal Neighborhood Window was proposed by Xie (Xie 1992) to 
achieve lower computational complexity. His algorithm performs poorly and 
produces non-localized edge points when texture images are used. Farag and 
Delp (Farag, et al 1991) proposed a linear path metric function for a se- 
quential search process. Their algorithm involves less amount of calculations 
compared to the Multiple Scale algorithm and performs well for textured 
images. However, some statistics such as a and a priori information about 
the processed images have to be known. Miller and Madeda (Miller, et. al 
1993) proposed a different type of edge linking algorithm using template based 
method. This template based algorithm scans the image three time to produce 
the final edge map. 

Earlier attempts to map edge linking algorithms into VLSI included work 
by (Ungureanu, et al 1993) who used a programmable gate array chip of 
ACTEL type to produce linked closed edge contours. Their algorithm involves 
path search procedures to track the contour lines. The processed image pixels 
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Figure 1 (a) The Basic Structures (b) The direction representation and po- 
sition. 



should first be stored in an ofF-chip memory before the linking process starts. 
An initial edge point should be chosen manually for the chip to start tracking 
the contour lines from it. The time needed to process one image exceeds 30-50 
ms for images of size 256x256. 



3 THE EDGE LINKING ALGORITHM 



3.1 Basic Concepts 

In this paper, we use six basic structures of which all edge lines take at any 
point. These basic structures form a complete set of edge line shapes, and 
they are divided into connected basic structures and broken basic structures 
as shown in Figure 1(a). Edge lines can be decomposed into a number of basic 
structures in many different ways. In general, we define a break point as an 
edge point from which it is impossible to extract any connected basic structure. 

Once a break point in some position in an edge map is identified, its direc- 
tion must also be determined. We define the eight directions by using complex 
number notations as shown in Figure 1(b). The real part of a break point di- 
rection represents the horizontal shift that the next linking edge point should 
take, and the imaginary part represents the vertical shift of it. A break point 
at position (Xjp) is said to have a direction a-hjb, where a,b e {-!,(?, -/-i}, if it 
is connected to a previous edge point located at the position (x-a^y-b). 

To guide the edge linking process effectively, some weak edge points are pre- 
served during the intensity edge map conversion. Thus, two different threshold 
values are used such that the larger value is used to produce only the wanted 
edge points without any branches and the smaller threshold value is used to 
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preserve some weak edge points. The first threshold value is referred to as the 
strong threshold. The other threshold value is referred to as the weak threshold. 
The edge map output resulted from using two thresholds is referred to as the 
trinary edge map. 



3.2 The Algorithm 

When using two threshold values to produce the trinary edge map, finding the 
break points and determining their directions is applied only for the strong 
edge map. Unlike some existing edge linking algorithms which tend to operate 
on a large area of an image thus requiring large amount of memory, our 
approach takes a small portion of an image at a time utilizing the weak edge 
points as guides to maintain accuracy while requiring much smaller amount of 
memory. The small portion area is marked as window. The entire image is then 
scanned by the window once to complete the edge linking process. Assuming 
that the image pixels are fetched in a row order form, top left to bottom right, 
the scanning pattern follows the same direction. Generally speaking, the choice 
of window size depends on the requirement for edge linking accuracy and the 
amount of memory and other hardware related constraints, bearing in mind 
that our goal is to map the algorithm onto a VLSI chip. The tradeoff resulted 
in the choice of a 10x10 window size with a need of the information from the 
surrounding pixels. Thus, a window size of 12 x 12 is chosen. 

Once the window is moved to a new location, all break points inside the 
window will be counted; their directions will also be determined. In order to 
start linking break points, the window itself is divided into four sub- windows; 
each sub- window has the size of 5x5 pixels from the core area. The linking 
process will start first within each sub- window. Once the process is completed, 
another linking process between sub- windows will start. The edge linking al- 
gorithm consists of the following steps: 

Step 1 Intra-Column Linking 

For each column in each sub-window, there are three scenarios with regard 
to break points. First, if there is no break points in a column, no action is 
required and move to the next step. Second, if there is one break point in a 
column, record its presence and move to the next step. Third, if there are two 
break points in a column, check the directions of these break points by testing 
the direction values using the following condition: 



Link = (P.di < cosd) A (— P.d2 < cosO) A (— di.d2 < cosO), (1) 

where: P = (x 2 - a:i) + j{y2 2/i)> the position vector; d{ is the direction of 
the break point i; {xi^yi) is the location of the break point i; [.] denotes the 
vectors dot product; 9 is the maximum allowed directional difference between 
two break points; and 0 is set to 45® 
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Step 2 Inter-Column Linking 

In this step, one pair of the unlinked break points from step 1 will be tried to 
be linked across different columns using the same direction matching criterion. 
The way of selecting the pair of break points in this step is to choose the pair 
with the farthest horizontal distance. Choosing the farthest distance points 
instead of the closest was based on some experimental observations which 
indicates better linking among broken edges. If the selected pair cannot be 
linked, one of the break points will be tried to be extended with the guidance 
of some nearly weak edge points if possible. The other break point in the 
pair is passed on to the next step for further processing. All the remaining 
unlinked break points are to be tried again when the window is moved to the 
next location. 

Step 3 Inter-Sub-Window Linking 

During this step, each sub- window may have one break point that has to be 
linked with the others. Thus, a maximum of 4 break points are to be considered 
within the entire window. There are 6 possible combinations of pairing any 
two break points out of possible 4. Linking will be tried on each pair using 
the same direction matching criterion. After linking the break points, any 
unlinked break point will be tried to be extended using nearly weak edge 
points as guidance. 

Step 4 Changing Window Location 

In this step, the window is moved to the next location and steps 1 to 3 are 
repeated. 



4 EXPERIMENTAL RESULTS 

The Lenna image was used to determine the effectiveness of the proposed edge 
linking algorithm. Applying the ADM edge detection algorithm (Alzahrani, 
et. al 1997) to the Lenna image, the intensity edge map produced is shown in 
Figure 2(a). This intensity edge map is the input to our edge linking algorithm. 
The trinary edge map produced by the two thresholding process is shown in 
Figure 2(b). The edge linking algorithm outputs the final linked edge map 
shown in Figure 2(c). 

5 VLSI IMPLEMENTATION OF THE EDGE LINKING 
ALGORITHM 

In order to allow real-time edge linking, our edge linking algorithm was 
mapped into hardware. The algorithm was first translated into the structural 
level design using VHDL and verified before gate level design took place. The 
edge linking circuit is organized in three major functional blocks: The Receiver 
Block, the Loop Block, and the Mask Block. Figure 3(a) is the top level block 
diagram which shows the signal flow of the image coming out from the edge 
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Figure 2 (a) The intensity edge map of ADM algorithm (b) The trinary 
edge map by the proposed algorithm (c) The final output edge map. 




Figure 3 (a) The overall circuit diagram (b) The Receiver block diagram. 



detection chip and passing through the proposed edge linking units till it is 
finally produced. 

The output signal from the ADM edge detection chip (Alzahrani, et. al 
1997) is a serial fiow of the intensity edge map pixels on an 8-bit bus. This 8- 
bit signal is to be mapped into 2-bit signal using two threshold values. Figure 
3(b) shows the circuit design for the Receiver block. Two 8-digit comparators 
are contained in this block to produce two binary signals Ds and such 
that Da = IN > Tfi, = IN > where: IN is the input signal of 8 bits, 
Ta, Tyj are the strong and weak threshold values, respectively. 

The main function of the Loop block is to prepare the image data for the 
Mask block to produce final output edge map. At the beginning of the linking 
process, the data of a 12x12 block from the top-left corner of the image must 
first be ready. Therefore, the first twelve rows of the image need to be stored in 
shift registers. Once the data of the first 12x12 block are ready in the Mask 
block, the linking process starts. Figure 4 shows the signal fiow inside the 
Loop and the Mask blocks. Notice that we divided the Loop into three pieces 
to illustrate the circular movement of the signal. The Loop block contains 
468x12 storage cells. Each cell has two D fiip-fiops: one to pass the strong 
edge points map and the other is for the weak edge points map. 
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Figure 4 The signal flow in the edge linking circuits. 



The Mask block contains 10x10 internal Cell units to accommodate the 
core part of the moving window, as described in the edge linking algorithm 
section. Figure 5 shows the arrangement of the units inside the Mask block. 
There are 44 Coat units located around the core area making the Mask window 
12x12 in size. The Coat cells are used to provide boundary information to 
the 10x10 mask, no edge linking process is applied to them. The Mask block 
itself is divided internally into 4 sub- windows, each having a size of 6x6 pixels. 
The communication between the Cell units is conducted via Net units. Within 
each sub- window, the Controllers unit is responsible for the Intra-Column and 
Inter-Column linking processes. Links between the sub-windows is controlled 
by the Controller 10 unit. In general, we can list the functionality of the Mask 
as follows: 



• The Cell units identify all break points and determine their directions. 
Figure 6 shows its schematic diagram. 

• Through the Net units, the signals from each Cell unit in a sub-window 
are sent to the Controllers in order to process the Intra- Column links and 
the Inter-Column links. 

• Each Controllers selects a pair of break points, if any, from each column 
using flve MuxC units. Figure 7(a) shows the logic structure of the MuxC 
unit. The MUXSx2 selects two break points with the longest distance and 
passes them to the follow-up logic for further identiflcation. 

• The pairs from each column is to be linked after checking their directions. 
Five MatchSxl units are used for this purpose, one for each column. Figure 
7(b) shows the design of a MatchSxl unit. 

• The break points information from the each column is then passed to the 
MuxR unit to perform step 2 of the algorithm (Inter-Column linking). The 
way MuxR works is very similar to that of MuxC in the sense that it chooses 
two break points from among flve columns. It then outputs the information 
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Figure 5 The Mask block diagram. 







Figure 6 The Cell unit. 



of the two selected break points from two farthest possible columns to the 
Match5x2 unit. 

• The Match5x2 unit checks the direction agreement of the two input break 
points coming from the MuxR unit. Similarly, it passes the location of the 
two break points along with the decision signal to link them. 

• In the Link5x2 unit, two stages are needed to draw a line between the two 
break points if the linking decision is high. In each stage, two Step5x5 units 
are used to fill the gap between the two break points. The output from the 
two Step5x5 units in the first stage is the X-Y location of two new points 



A VLSI architecture for real time edge linking 



23 





X1(4:0) ^ 




XI +(4:0) 




XI ++(4:0) 


X2(4:0) 


StepSxS 


X2+(4:0) 


StepSxS 


X2++(4:0) 


VI (4:0) _ 




Y1+(4:0) 




Y1++(4:0) 


Y2(4:0) 


StepSxS 


Y2+(4:0) 


StepSxS 


Y2++(4:0) 



(c) 



Figure 7 (a) The MuxC unit (b) The Match5x2 unit (c) The Link5x2 block 
diagram. 



such that the gap between them is shorter than the original gap of the 
break points. Another stage of a similar process is then enough to connect 
the break point by a linking line. The block diagram for the Link5x2 unit 
is shown in Figure 7(c). 

• If the linking decision signal from the Match5x2 unit is low (decision for 
not linking), one of the selected break points will be fed to a Weak unit. 
The function of the Weak units is to extend the input break point one or 
two pixels toward the break direction if there is weak edge points nearby. 
The advantage of extending break points along their directions on a weak 
edge path is to shorten some long gaps between break points. This way, 
the linking process inside the sub- window becomes effective and sufficient. 

• All the functional units mentioned above constitute the Controllers unit 
as shown in Figure 8. 

• The other unlinked break point left from the Match5x2 unit will be sent to 
the ControllerlO unit which receives up to four possible break points from 
the four sub-windows. Six possible comparisons will be conducted to draw 
linking lines at the Inter-Sub- Window linking stage. 

6 COMPLEXITY ANALYSIS, CHIP SIZE ESTIMATIONS 

The entire edge linking circuit consists of about 10,000 logic gates and 11,520 

flip-flops. Estimation of the chip area was carried out by using a O.Sfirh CMOS 
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From Columns 




Figure 8 The Controllers block diagram. 



standard cell library. Adding up the areas of the all gates used in the circuit, 
we get approximately the overall gate area of the chip to be 10 mm?. The 
routing area is approximated to be the same as the gate area needed on the 
chip. Thus, the total chip area is about 20 mm?. 

A critical path measurement is also carried out to determine the potential 
performance of the edge linking chip. Tracing the critical path in the circuit, 
we find the critical path consisting of 35 cascaded gates. Figure 9 shows the 
critical path indicating the delays of each unit at its output. The maximum 
fanout among these gates on the critical path is 4. Assuming a conservative 
average gate delay of Ins using the 0.8/xm CMOS technology, the minimum 
clock cycle time is 35 nsec. Thus, the maximum allowable clock frequency is 
about 28.5 MHz. Considering that the chip is capable of outputting a pixel per 
clock cycle for the linked edge map, this is 3 times the video rate requirement 
for VGA-sized images, or 1.2 times the video rate requirement for SVGA-sized 
images. 



7 SIMULATION RESULTS OF THE VLSI ARCHITECTURE 

The Lenna image was processed by the circuit through a logic simulator as 
well as a structural level VHDL simulator. Both the schematic logic circuit 
and the VHDL code are tested using the Lenna image. Figure 10 shows the 
output edge map of the Lenna image from the simulated VLSI circuit and 
from the simulated VHDL code. The threshold values used for the Lenna 
image were Tg = 40 and = 20. 
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Figure 9 The critical path. 




Figure 10 (a) The VHDL code output (b) The logic gate circuits output. 



8 CONCLUSION 

In this paper, we have presented a new algorithm for edge linking and its 
VLSI implementation. The software implementation, using a C-f -f code, was 
proven to give good output results through four major steps of edge linking: 
the Intra-Column Linking step, the Inter-Column Linking step, the Inter-Sub- 
window Linking step, and the Changing Window Location step. 

Although our algorithm does not produce better results than other exist- 
ing edge linking algorithm, the hardware implementations of it have been 
developed and proven to produce output edge maps in real-time processing 
environment. In implementing the VLSI circuits, we tried to make the design 
as simple as possible in order to have smaller die size. The strategy used while 
building the circuits in the gate level is to have symmetric designs with less 
inter metal wiring between blocks. The proposed edge linking circuit is able to 
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process 33 VGA/SVGA frames per second in a normal operation mode with 
the clock frequency of up to 28.5 MHz. 

The proposed edge linking circuit has two major limitations. First, the 
output edge maps are not guaranteed to have closed contours. It maybe the 
same situation for other well-known edge linking algorithm, but we should 
consider it as a problem. Second is that the edge linking process is sensitive 
to the choice of the strong threshold values. That is, fixing an initial number 
for the edge linking chip to be a permanent threshold value for all possible 
input images would result in poor edge linking. 

One possible solution to the threshold problem is to build a feedback con- 
troller circuit that adjusts the threshold values such that a feedback signal 
from the image matching circuit gives the matching percentage of the detected 
image versus a known template image. The control circuit then increases or 
decreases the threshold value correspondingly. 
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Abstract 

In this work a parallel architecture is proposed for VLSI implementation of a data- 
flow algorithm for 2D boundary (or contour) detection.The algorithm works on the 
gradient image and uses a set of primitive paths to generate all possible contour 
paths on a neighborhood defined by a 5x5 window. The objective is to determine 
whether or not the neighborhood central pixel belongs to a continuous boundary 
line passing across the window. Test results show that one-pixel wide continuous 
boundary lines can be extracted using this algorithm. 
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L INTRODUCTION 

Polygonal modeling has traditionally been one of the most popular methods for 
shape representation and still is largely used in applications where shape 
recognition of two-dimensional objects or surfaces is needed [1,2]. 

Currently a lot of attention is paid to image sequence and video coding due to the 
increasing importance of high speed multimedia applications [1-3]. In fact, model- 
based image coding is a powerful technique for compressing head-shoulder 
images, as in the MPEG-4 sequences imposing very low bitrate coding. In image 
coding, compression ratio and reconstruction quality are two issues of great 
importance. The use of adaptive coding based on appropriate image models is one 
of the most effective approaches to achieve these goals simultaneously. So, 
polygonal modeling is an important tool for these intelligent coding schemes, 
where the coder searches each image for objects which are identified according to 
some underlying objects models [4]. Large compression ratios result since once an 
object is identified, it can be tracked through a number of frames in a sequence and 
only subsequent changes in the model parameters (shape, motion, etc.) need to be 
transmitted. 

Polygonal models have the advantage of being a local representation, i.e., they 
preserve local shape features therefore allowing for object recognition, even in the 
presence of partial occlusion of objects. Additionally, they can be made insensitive 
to rotation, translation, and scaling, (a requirement for any practical recognition 
system) and are much less computationally expensive than higher-order 
polynomial approximations. As a drawback, the representation provided by 
polygonal models is usually not as compact as those based on global features such 
as moments or transform descriptors. A more complete analysis of the issues 
involved in those and other forms of shape representation has been made by 
Pavlidis [10]. Further details about the advantages of polygonal modeling can also 
be found in the literature [8,9]. 

In order to operate properly, algorithms for polygonal modeling of 2D objects 
require shapes with continuous and well defined boundaries. Generation of this 
boundary is the objective of the preprocessing phase to which the original image of 
the object to be modeled is normally submitted. As part of a typical preprocessing 
operation initially a discrete gradient operator is employed to generate a gradient 
image, upon which boundary tracking segmentation can be performed [10]. 

Most of the work so far reported on algorithms for boundary extraction on digital 
images assume a sequential software implementation, either on a general purpose 
computer or on an specialized signal processor. This type of approach is 
inadequate if real-time high speed operation is desired, due to the computationally 
intensive character of low-level image processing. Applications such as vision 
systems for mobile robots or video compression may require 512x512 image 
frames with 256 gray levels (8 bits) to be processed at a rate of 30 frames per 
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second. Real-time operation at this rate would require a processing time of 120 
nanoseconds for each image element or pixel. This performance figure becomes 
clearly out of reach for any sequential general purpose computer when one 
considers the amount of multiplications, additions and other operations usually 
involved in each output pixel computation. The obvious solution is to design 
parallel algorithms and architectures which can be implemented in specialized 
integrated circuits called ASIC's using CAD-based VLSI design tools. 

The availability of very powerful and easy to use VLSI design tools has fueled the 
development of several real-time image processing systems, particularly for low- 
level feature detection and extraction applications. Bhanu et al. have designed and 
implemented a real-time segmentation processor which makes use of a gradient 
relaxation algorithm (iterative) to assign pixels into classes, based on their gray 
value and the gray values of neighboring pixels [11]. Ranganathan et al. have 
proposed a VLSI architecture which convolves images with eight 15x15 kernels in 
order to implement a technique for corner detection which is based on the concept 
of half-edge and on the first derivative of Gaussian [12]. Cheng et al. utilized the 
theory of dynamic programming to develop a backtracking method for curve 
detection and designed an associated VLSI architecture which solves the problem 
in 0(n) time, where n is the length of the curve to be found [13]. All these 
architectures make use of pipelining and parallelism in order to achieve real-time 
performance. 

In this work a parallel architecture is proposed for VLSI implementation of a data- 
flow algorithm for 2D boundary (or contour) detection. The algorithm works on 
the gradient image and uses a set of primitive paths to generate all possible contour 
paths on a neighborhood defined by a 5x5 window. The objective is to determine 
whether or not the neighborhood central pixel belongs to a continuous boundary 
line passing across the window. 

The rest of the work is organized as follows: Section 2 describes the algorithm for 
boundary detection. Section 3 shows results obtained by simulating actual circuit 
operation. Section 4 estimates hardware implementation cost. Finally, conclusions 
are drawn in section 5. 



2 ALGORITHM 

For hardware implementation of image processing, the most performant algorithms 
are those of the data-flow type. The image pixels come in serially, pixel by pixel, 
one line after the other. For every incoming image pixel, a data-flow algorithm 
produces one output pixel. The output pixel is determined by a function of the 
corresponding input pixel and neighboring pixels. These pixels form a window. 
Apart from the function, the performance and the data storage requirements of 
data-flow algorithms depend on the size of the window, as shown in table 1. The 
storage size and delay times are for images with 8 bits per pixel, 512 pixels per 
line, 512 lines per image, 30 images per second. 
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The performance characteristics of data-flow algorithms makes them the best 
choice for real time image processing. 

The boundary detection algorithm proposed in this paper consists of the first three 
processing stages shown in figure 1. Each stage will be described in detail in the 
following paragraphs. 

Table 1 Window size and performance of data-flow algorithms 



Window 


Storage space 


image 


pixel X pixel 


required (bits) 


delay (ps) 


2x2 


1 line + 2 pixels = 4112 


1 pixel = 0.1 


3x3 


2 lines + 3 pixels = 8216 


1 line + 2 pixels = 61.7 


5x5 


4 lines + 5 pixels = 16424 


2 lines + 4 pixels = 123.3 




original image gradient image flrst border image second border image 

Figure 1 The image processing stages of the proposed algorithm 



vertex points 
with Hugb-value 



filtered 
vertex points 



2.1 Spline Gradient 

In a 5x5 matrix we define four masks for the Spline gradient based on the one- 
dimensional spline coefficients obtained as described in [14]. The two-dimensional 
spline gradient gp is obtained from the horzontal, vertical, and diagonal gradients 
gpx’ gf,’ gp 2 > respectively: 

gp ”3(|gpx|'*‘|5pK|)'l'2(|g/>i| + |gp2|) 

One of the main advantages of this gradient operator is that it is less sensitive to 
noise. This comes from the fact that the Spline gradient uses a 5x5 window and 
determines its result from the mean value of horizontal, vertical, and diagonal 
derivatives. Apart from that the boundaries are sharper, making the gradient values 
of the contour pixels significantly higher than those in the neighborhood. That 
happens because the spline interpolation best approximates the discrete points to 
the original function. 
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2.2 Local Maximum and Pathfinder I 



This phase works with two criteria. If both criteria are validated, the center pixel of 
the window becomes the output pixel, otherwise the output pixel becomes zero. 
Note that the output image at this phase is not a binary image. It has as many gray 
levels as the input image. 

The local maximum criterion is based on a 5x5 window. A simple criterion would 
be that if the center pixel were among the five biggest values in the window it 
would be chosen as a contour pixel [15]. However, this condition is insufficient 
because in the cases where many pixels had the same gray level value the center 
pixel had its significance lowered. Therefore, there should be some kind of weight 
assigned to the center pixel in order to help make the decision. This weight is 
defined as: 



jo ,ifNs=0 

+ Pc ' otherwise 



( 2 ) 



where, IV = weight of center pixel; 

= number of pixels smaller than the center pixel; 

= number of pixels equal to the center pixel; 

Pc - value of the center pixel (0<p^< 32). 

The multiplication of by 2 is used to emphasize the importance of this term in 
comparison to N^. 

The local maximum criterion is then validated if the center pixel value is greater 
than 13% of the full scale value and if its weight is bigger than a threshold value, 
which was found empirically to be 30. 

The pathfinder criterion also works on a 5x5 window. This criterion determines 
whether the center pixel of the window belongs to a continuous border line passing 
across the window. To accomplish this, all possible border paths across a 3x3 
window, shown in figure 2a inside the squares formed by the pixels in the 
positions 1, together with their possible continuations, represented by the positions 
2 and 3, must be checked. The possible paths are derived from the primitives in 
figure 2a by mirror and rotation operations. Eliminating repetitive patterns, a total 
of 44 distinct paths was obtained. Note that the paths in figure 2a do not contain 
any sharp corners, which would difficult subsequent border tracking. 
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Figure 2 Primitive paths for pathfinder I (a) and pathfinder II (b) 



The pathfinder criterion is validated if the center pixel belongs to two of the eight 
paths with the largest values. The value of a path is determined by the weighted 
sum of its pixel values: 



Ws = 



M, + M, 12 ^ 



pi i=l 



, if condition 
, otherwise 



( 3 ) 



where, Ws = Weighted sum; 

yVp, = Number of positions 1 in the 3x3 path; 
p. = Gray level value of the pixel in position 7; 

M, = Maximum value among the pixels in position n. 
condition = p/s and Mfs must be all bigger than 13% of the full gray scale value. 
Note that Wis is always an integer value. 



2.3 Pathfinder II 



The pathfinder II algorithm uses only paths containing at least 3 pixels in the 
positions 1. These 28 paths were derived from the primitives in figure 2b applying 
the same technique used in pathfinder I. 

The same equation (3) is used to calculate the weighted sum for each path. Then 
the two highest results, called here and associated to the masks m, and m^, 
respectively, are determined. If the highest sum, always assumed to be is equal 
to zero then the center pixel of the window does necessarily not belong to the path, 
and the output pixel is set to 0. 

A second condition to be analyzed is when and have an equal value and m, 
and /Hj are adjacent (adjacent paths are those which are different from one another 
by just one pixel). This situation is considered a stalemate, because the two 
different pixels in each path could be equally labeled contour pixels. This 
stalemate is resolved with a simple logic. If there is a vertical stalemate, this logic 
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keeps the pixel on top and discards the other one. If the stalemate is horizontal, the 
pixel on the left is kept and the one on the right is discarded. 

In the absence of stalemate, if contains a position 1 in its center the output is set 
to l,else it is set to 0. 

This criterion assures that only one of the pixels of the different paths is set to 1, 
providing a one pixel wide border line. In the examples shown in section 3, the 
stalemate situation occurs at about 4% of the border pixels. 



3. SIMULATION RESULTS 



Operation of the border detection hardware implementation was simulated by a 
program written in C language. The choice of C is justified by the fact that it 
allows fast simulations at functional level providing fast turn-around time for 
debugging and parameter adjustment. A hardware description language like 
Verilog or VHDL runs slower and is more complicated to debug. 

Images of two objects were used to validate the algorithm: An arch-shaped toy 
block, figure 3 and a screwdriver, figure 4. 

The toy block is made of 64x64 pixels. 32 gray levels (5 bits) are used. 

Figure 3a shows a shape with sharp contrast. The gradient's figure 3b reaches 
strong gray level values all around the border and the contour line is relatively 
thin. This makes the task of border extraction rather easy, resulting in a perfectly 
continuous line no more than one pixel wide in figure 3d. 




Figure 3 Original image of toy block (a), gradient image (b), first border(c) and 
second border (d). 
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The screwdriver is made of 220x128 pixels. 32 gray levels (5 bits) are used. 




b 

Figure 4 Original image of the screwdriver (a) and second border (b). 

In comparison to the toy block, the screwdriver in figure 4a has poor contrast, 
mainly due to its round shape at perpendicular sections of the image plane and 
because of the shadow caused by light that falls onto the image plane in angle from 
the left side. To avoid the shadow, the light source should be located next to the 
camera. The contrast is particularly bad at the tip of the screwdriver and at the top 
of the handle. Even though, the results obtained after the image was run through 
the algorithm were very good, except for a two pixel wide border point located in 
the upper bottom comer of the handle, zoomed in figure 4b. 

Still, the quality of the contour in figure 4b is clearly sufficient for a subsequent 
vertex extraction procedure [16]. 
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4 HARDWARE IMPLEMENTATION 

This section will show how the operators used in the algorithm described in 
section 3 are implemented in hardware. An estimation of chip area required for a 
3-metal 0.5pm technology is also given. 

The operators described in the following also use the faster system clock to reduce 
hardware costs by a 5 stage pipelining. 

The first operator is the gradient operator. It is divided in two parts. The first part 
is used in the first 4 pipelining stages to calculate the absolute values of the 4 one- 
dimensional gradient values gy, g^^ During the last stage, the weighted sum 
according to equation (1) is calculated in the second part. 




Figure 5 Schematic for determination of parameters 



The implementation uses off-chip RAM for the line memory (see table 1). Only 
the registers required for storing the pixels of one window are on the chip. In each 
pixel clock cycle, one new window column consisting of 5 pixels is loaded from 
the off-chip RAM. The pixels are loaded serially pixel by pixel using a system 
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clock that divides the pixel clock cycle by 5 in order to reduce the number of pins 
of the circuit is thus reduced. 

The second operator is the local maximum operator. In each stage, 5 pixel values 
are compared to the center pixel and the number of pixel values bigger than the 
center pixel is accumulated across the stages (figure 5). The accumulated value is 
compared to the threshold value in the first stage of the next pixel clock cycle, 
increasing the output image latency by the time corresponding to one pixel clock 
cycle. 

The third operator is pathfinder /. As there are 44 paths, equation (3) (see figure 6) 
must be evaluated 9 times in each stage. In the same circuit, the 9 resulting weights 
are sorted and the 8 highest weights are stored. In the following pipelining stages, 
these 8 weights are sorted together with 9 new weights. In this manner, the 8 
highest weights of all 44 paths are available and the final result can be easily 
obtained. 




Figure 6 Schematic of equation (3) 

The fourth operator, pathfinder II, works in a similar way to pathfinder I. 
However, the number of paths weights to calculate and to sort is smaller. 

Note that all the four operators can be implemented on the same chip. Two 
external pins can be used to select the output of the chip in order to program the 
function that the chip will actually perform. This option allows to use three 
identical circuits with only 16 pins each to implement all the operations needed for 
the algorithm. 

A transistor count and chip surface estimation is given in table 2. 
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Table 2 Transistor count and chip area 



operator 


number of transistors) 


chip area (mm^) 


gradient 


5323 


0.59 


local maximum 


1484 


0.16 


pathfinder I 


27252 


3.03 


pathfinder II 


15528 


1.73 


total 


49587 


5.51 



5 CONCLUSION 

The simulation results show that the hardware implementation of the proposed 
algorithm is capable of producing good results even for images with poor contrast. 
In both cases a continuous, well defined, one pixel wide contour was extracted by 
the circuit, which will ease considerably the task of the vertex extraction algorithm, 
last stage of the polygonal modeling operation. 

A data-flow solution for the vertex extraction procedure is rather more complicated 
than a sequential one. Sequential algorithms progress pixel by pixel along the 
shape contour, testing each new pixel in order to detect vertex for the polygonal 
approximation. A list of the boundary points has to be available beforehand, which 
precludes its use on high-speed real-time operation. In the other hand, data-flow 
algorithms work on the image as it is being acquired line by line. Therefore, they 
represent the only possible solution if real-time performance is to be achieved. 
However, data-flow algorithms look at the image through a rectangular window of 
a given size, 5x5 or 8x8, for instance. This fact makes it difficult to determine to 
which shape a contour segment belongs, when more than one object (or shapes) 
are present in the image scene being modeled. A compromise solution to the above 
problem may be a hybrid architecture, where vertex detection would be followed 
by a non data-flow procedure in charge of assigning detected vertex to objects. 

The development of an algorithm for VLSI implementation of the complete 
polygonal modeling procedure is presently under way. Once it is completed and 
integrated with the circuit described in this paper, a complete high-speed system 
for polygonal modeling of 2D shapes will be available, capable of performing at 
the rates required for real-time applications. 
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Abstract 

The design of a CMOS piezoresistive pressure sensor with mixed-signal circuitry 
to provide a digital output is described. The on-chip sensor is a monolithic silicon 
etched diaphragm, built via post-processing, with the strain gauge composed of 
diffused resistors in an active Wheatstone bridge configuration. The amplifier’s 
bridge is chopped for efficient low frequency noise and amplifier offset 
cancellations. The A-to-D converter is based on a second-order AZ modulator 
including 6 bits DAC units for sensor offset calibration. The circuit has 
temperature compensation of both full output scale and offset. Overall system 
resolution is 12 bits, corresponding to 72dB dynamic range, while the pressure 
range is 0 to 50kPa. Pulsed mode operation allows for low power dissipation 
(<lmW) at 3V supply voltage. This device is intended to be used in biomedical 
applications such as non-invasive blood pressure measurement and diagnostics. 

Keywords 

Pressure sensor interface, post-processing, MEMS, active Wheatstone bridge, AZ 
modulator, temperature compensation 
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1 INTRODUCTION 



Pressure sensors still continue being an important device for the industrial and 
biomedical fields due to its wide application range. Much work has been done to 
fabricate pressure sensors monolithically integrated with its necessary on-chip 
electronic circuitry for signal processing, temperature compensation and parameter 
calibration, and good results have been presented (Jakob, 1993; J. Dziuban et al, 
1994). Nevertheless, characteristics such as area, power consumption, and 
resolution have not been optimised. This work combines two key ideas to develop 
a fully integrated pressure sensor. The first one is the conception of an architecture 
able to attain high resolution, low power consumption and thermal compensation. 
The second idea is to employ post-processing for the final MEMS structure 
fabrication. So, it is possible to fabricate a highly reliable, robust device based on 
standard industrial technologies. 

The sensor device consists of a membrane created on a silicon wafer by 
anisotropic etching from the backside. The target application is blood pressure 
measurements directly in the patient blood stream. Therefore, pressure and 
temperature range are 0 to 50kPa and 10 to 45°C, respectively, over which the 
sensor is calibrated for both offset and sensitivity. The sensor system consists of: 

• A chopped active Wheatstone bridge based on a fully differential amplifier 
that effectively doubles the sensitivity and allows sensor pulsed operation. 

• A 12-bit A/D converter implemented via a second order AI modulator. 

• Auxiliary DAC units for offset calibration. 

• Digitally programmable compensation of sensor sensitivity (TCS) and offset 
(TCO) dependencies on temperature. 

• Reference voltage and current generators based on a bandgap circuit. 



Offset calibration 




12 bits @ IkHz 



VrefliT) Vctrl 



^ref3 



Clock 



Temperature 

compensations 




Figure 1 Sensor system architecture. 
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The sensor system architecture is shown in Figure 1. Section 2 describes the 
sensor itself as well as its fabrication process; section 3 shows the active bridge 
architecture; section 4 explains the design of the active bridge amplifier; section 5 
shows the A/D converter; and section 6 presents the temperature compensation 
circuits. Finally, some important conclusions are presented in section 7. 



2 SENSOR DEVICE 

The sensor device design and final fabrication (post-processing) are done in our 
laboratory. The sensor contains two resistive networks properly placed in a square 
silicon diaphragm for optimal sensitivity. Each resistive branch has two diffused 
resistors of IKD positioned transversal and longitudinally with respect to the 
diaphragm edges; the membrane thickness is lOpm. This kind of microsensor 
explores the silicon piezoresisitve effect produced by the membrane deflection as 
consequence of the applied pressure. The optimum resistor placement is 
determined using finite element simulations with ANSYS (ANSYS Inc., 1990). 

The diaphragm was obtained through anisotropic etch of the silicon bulk. The 
KOH solution concentration for the anisotropic etch was 27% at 85°C. First, a 
Si 3 N 4 -PECVD layer (low temperature process) as passivation mask was deposited. 
After that, the passivation layer window was obtained via Reactive Ion Etching 
(RIE) on the layer making a normal photolithography process with an 
Electronic Visions Optical aligner. The surface containing CMOS circuitry was 
protected against wire bonding pad etching employing a mechanical device. Then, 
we made a fast dip in HF solution at 1,0W% to strip out the native oxide on the 
exposed silicon area. Finally, we could sink water in the KOH solution to obtain 
the pressure sensor diaphragm on the wafer surface. 



3 ACTIVE BRIDGE ARCHITECTURE 

Looking in Figure 2 (a) and assuming and e are the voltage applied on the 
bridge and the relative branch resistance variation, respectively, the output voltage 
is given by: 

Vou, =[iu-e)-f(l + e)l =-£-Vjr/v.- 0) 

By contrast, the active bridge in Figure 3b effectively doubles the sensor 
sensitivity due to the resistors works as a feedback network around a fully 
differential operational amplifier. In this case, the common mode feedback circuit 
keeps the output common mode voltage equal to the input reference signal: 






( 2 ) 




44 



Part Two Microsystem and Mixed-mode Design 




Figure 2 Passive (a) and active (b) piezoresistive bridges. 

Considering the amplifier inputs are at the V, voltage, the branch currents are: 

= l2 = [Vj(l-z)]. (3) 

Therefore, the output voltages are: 

= -[2/(1 - e)] • V, , V - = -[2/(1 + e)] • V, . (4) 



And, using equations (2) and (4), we can easily derive: 

= = (5) 

For adequate attenuation of both amplifier offset and 1/f noise, the amplifier is 
chopped at 250kHz, half of the converter clock frequency. The chopping 
frequency is chosen high enough so that it lies above the 1/f corner frequency (f^ in 
Figure 3(a)) and still permits a low-power amplifier design. On the other hand, 
besides having doubled sensitivity for the same drive voltage, the active bridge 
allows for sensor pulsed operation by switching its power supply. This fact can be 
explored to save power consumption since sampling at the following switched 
capacitor circuit corresponds to a very short time. However, to reduce overall 
power dissipation, the duty cycle value should be carefully chosen. As shown in 
Figure 3 (b), despite resistor consumption is proportional to the duty cycle the 
amplifier consumption has an opposite tendency due to it charges and discharges 
the load capacitance. Typical slew rate requirements (50v/ps) will make such 
power dissipation inversely proportional to the pulse duty cycle. An optimum point 
exists which is characteristic of the amplifier design and it has been found to be 
25%. A general description of the control waveforms is illustrated in Figure 4. 



4 ACTIVE BRIDGE AMPLIFIER DESIGN 

Figure 5 displays the two stages of the bridge amplifier topology. The second 
stage, composed of the driver transistors Ml 1 and Ml 2, furnishes the current for 
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Figure 3 Amplifier noise (a) and power consumption vs. duty cycle percentage. 
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Figure 4 Timing of the control signals. 
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Figure 5 Schematic diagram of the two stages fully differential amplifier. 



the sensor resistive branches, which constitute its actual load, with a resulting gain 
of 28 dB. The first stage is a folded cascode stage thereby providing an overall 98 
dB gain at 3V supply voltage (see Figure 6). Design assumes a load capacitance of 
5pF taking into account the diaphragm may be placed far from the amplifier 
although on the same chip. This load determines the transconductance of 
transistors M11-M12 driven a large sensor bias current (1.25mA). Thus, the 
remaining amplifier power consumption is negligible. Using the amplifier output 
stage to provide the sensor current together with the pulsed operation mode are the 
key issues determining the excellent power consumption level of the system. 
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Figure 6 Bridge amplifier gain magnitude and phase responses. 




Figure 7 Schematic diagram of the common mode feedback circuit. 

Note, in the common mode feedback circuit of the bridge amplifier shown in 
Figure 7, the reference voltage V^(T) fix the drive voltage on the resistors, as 
verified by simulation in Figure 8. The common mode gain-bandwidth product of 
the CMFB circuit is actually larger than the differential one. It is due to the 
requirement of properly driving the amplifier through V„JT), a common mode 
signal, with a very short pulse. To do that, the amplifier has an adaptive biasing 
network (Degrawe et al, 1982) controlling its first stage (devices M15 and M16 
governed by signal in Figure 5). Also, observe the pulsed mode operation of 
the bridge amplifier is achieved including the power-on signals and 



5 ANALOG TO DIGITAL CONVERTER 

We employ a AS modulator and its associated digital decimator filter to convert 
the sensor output voltage in a digital word. First order modulators require a dither 
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Figure 8 Simulation results of bridge amplifier for a common mode ramp signal. 

signal to reduce unavoidable noise tones to acceptable levels, while high-order 
modulators suffer from potential instability. Hence, 2nd order architecture has 
become the most attractive option. The performance target in our design is 12 bits 
resolution, which is equivalent to 72dB dynamic range, over a 500Hz baseband. 
Using a 2nd order AL modulator, this performance can be attained with an 
oversampling ratio (OSR) of only 50 corresponding to 50kHz sampling frequency. 
However, for proper noise cancellation in the chopped active bridge as well as 
the correlated noise owing to the double sampled 1st integrator, a 500kHz 
sampling frequency was chosen instead. This value does not unduly increase the 
overall power dissipation since the switched capacitor values are scaled down by 
the same factor as the clock frequency increases resulting in similar slew-rate 
requirements for the amplifiers. A fully differential architecture was adopted, as 
illustrated in Figure 9, following the same active bridge strategy and ensuring a 
high power supply rejection ratio and reduced clock feedthrough errors. An 
additional suppression of the signal-dependent charge injection is obtained using 
delayed phases for controlling all signal handling switches (Boser et ai, 1988). 
During phase kl the 1st integrator input is sampled onto the capacitors C„ 
simultaneously with the amplifier auto-zeroing. During phase k2, the charge stored 
on C„ is transferred to the integrator capacitance C,. At the same time, the 
reference voltage is either subtracted or added to the integrator depending on 
the comparator state by appropriately setting phases xl and jc2. This allows to have 
only one positive reference voltage. The operation at the 2nd integrator is similar 
without auto-zeroing. Besides, due to the need of having the comparator state 
stable during the full sampling/transfer period, an additional latch is required 
whereas phases xl, x2 are delayed by half the clock period respect to yl, y2. The 
sensor offset calibration is accomplished via auxiliary DAC blocks coupled with 
additional switched capacitor branches to the 1st integrator inputs. 
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Figure 10 shows the AX modulator amplifier’s diagram. It is a single stage 
mirrored amplifier operating at a low supply voltage with minimum power 
dissipation (Gray et ai, 1993). High gain is obtained using wide channel length 
devices for output transistors M5-M8, which are biased at large overdrive voltages 
as allowed by frequency stability, while minimising both thermal and 1/f noises. 
Moreover, we employ a switched-capacitor configuration for the common mode 
feedback circuit (see Figure 9). Finally, the simulated power spectrum for noise 
and 488Hz sinusoidal input signal is illustrated in Figure 11. Note the 75dB noise 
floor accomplishes the dynamic range target. 




Figure 10 The AX modulator’s amplifier. 



6 TEMPERATURE COMPENSATION CIRCUITS 

It is well-known diffused resistors have a temperature dependence (TCS) affecting 
strongly the sensor sensitivity. The main factors responsible for this effect are the 
temperature dependence of the resistance itself (TCR) and the gauge factor 
dependence on temperature (TCGF). Typical TCS values for Wheatstone bridges 
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Fourier Transform 
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(a) 

Fourier Transform 




FREQUENCY (Frequencyihz ) 

(b) 

Figure 11 Output spectrum of the AZ modulator without excitation (a), and with a 
sinusoidal input signal (b). 

using diffused resistors ranging from -1100 to -2300 ppm/°C. The most common 
method for temperature compensation in passive bridges is to modulate the drive 
voltage with a complementary temperature dependence. Figure 12 presents the 
block generator of the reference voltages and where the 

normalised reference voltage is given by: 



[Vref {T)/Vref 1 = 0 " {G j N )\±\1 In 8(V^)F • 



( 6 ) 
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Figure 12 Diagram of the reference voltage circuit for temperature compensation. 

where G is the gain of the programmable amplifier, and N is the number of the 
tap resistance. We employ positive temperature dependence in (6) to compensate 
TCS via Vref](T)y and negative dependence to compensate TCO via The 

circuit in Figure 12 can neutralize sensitivity variations at different temperatures 
with 6 bits accuracy (64 steps) throughout the programmable gain amplifier. 
Likewise, it maintains the same output reference voltage defined at ambient 
temperature, independently of the amplifier gain. 



7 CONCLUSIONS 

A novel architecture, the most important subsystems and characteristics, and an 
approach for a full integration of a piezoresistive pressure sensor have been 
exposed. In accordance with our experimental results, we found it is feasible to 
make the MEMS structure (sensor element and signal processing circuitry) using 
the post-processing method. The membrane was built onto a substrate with similar 
characteristics encountered in the foundry wafers. The new architecture approach 
provides higher sensor sensitivity dissipating low power, permits obtaining the 
digital output employing a high performance AD converter, and includes thermal 
effect compensation techniques. The architecture characteristics indicate it is 
suitable in developing piezoresistive pressure sensors with high performance. The 
device has been designed for biomedical applications, specifically blood pressure 
measurements. However, the basic mixed-mode interface may support a wide 
variety of sensors and MEMS devices. 
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Abstract 

This papers describes a new technique for on-line testing of analog circuits based 
on plant recognition by adaptive algorithms. To the authors knowledge, this is the 
first time such technique is used to on-line testing of analog circuits, allowing 
complete fault coverage. The paper presents the testing methodology and 
experimental results showing easy detection of soft, large-deviation and hard 
faults, with a low cost digital processor. Components variations as low as 10% 
have been detected, as the comparison parameter (output error power) varied from 
300% to 20%. 
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1 INTRODUCTION AND MOTIVATION 

Faults in analog circuits can cause different symptoms at the circuit outputs, from 
short and opening of components to slow deviation of the operation point caused 
by a degradation in passive or active component characteristics. On-line analog 
testing should detect any variation on circuit performance and warn a circuit 
supervisor. Different techniques regarding on-line testing of analog circuits have 
been reported ((Chatterjee, 1993), (Vazquez, 1993), (Lubaszewski, 1995)). Most 
approaches use some sort of analog circuit either to duplicate the analog function 
or to monitor its output. 

In our approach, we observe the output of the analog circuit and compare it with 
an expected output. The comparison, however, is developed at the digital domain, 
with an adaptive filter. The function of the adaptive filter is to duplicate the analog 
behavior of the circuit at the digital domain, and detect any deviations from the 
operating point in case some fault is present. Any analog function that can be 
represented in the s plane by a pole-zero polynomial fraction N(s)/D(s) can be 
tested using this technique. Since we use digital filters, any other analog function 
expressed in the z-domain as a polynomial N(z)/D(z) can also be tested. 

This paper is organized as follows: section 2 presents the overall idea of our 
approach, with mathematical explanation of its principles. Section 3 analyses a set 
of examples, were we show the success of the approach by detecting faults with 
some auxiliary digital hardware. In section 4 we discuss some limitations of the 
new technique, followed in section 5 by our conclusions and future work. 



2 THE ADAPTIVE ALGORITHM APPLIED TO TESTING 
2.1 The Adaptive Tester 

In (Ben-Hamida, 1996) and (Mielke, 1996) DSP (Digital Signal Processing) 
techniques have been used to detect misbehaviors of AD converters. Basically, the 
Fast Fourier Transform was used to identify the presence of harmonics showing 
signal degradation. In our approach we first recognize the circuit under test as a 
plant, with all its pole-zero characteristics, like an analog signature. Then, we 
apply specific algorithms to detect any deviations from this first obtained analog 
signature. 

The theory regarding linear plant recognition is quite settled (Ogata, 1970). 
Recently, with the advent of fast DSP microprocessors and boosted by digital 
communication problems, adaptive filtering took place as a good mathematical 
framework to solve not only line equalization problems, but plant recognition as 
well. The idea of an analog plant recognition by the use of an adaptive algorithms 
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is shown in Figure 1. A plant is submitted to an excitation, the same being done 
with an adaptive filter. Their output is compared, and the amount of error helps the 
adaptive filter change its coefficients in order to track plant performance. 




Figure 1 Basic plant recognition structure 

After some time, all filter coefficients have stabilized, and the plant has a twin 
plant implemented in the filter. Now, one has only to change the output of the error 
signal in order to transform this adapted plant to a tester, as shown in Figure 2. 
Once we substitute the plant by a circuit to be tested, the previously adapted filter 
will output an error signal showing how much the new plant is equal to the 
previous (fault free) one. It is interesting to notice that this explains the robustness 
of the methodology. The basic idea is an analogy to the detection of variations in a 
bridge circuit. Any fault in the analog circuit, being it soft, hard or catastrophic, 
will change the plant under test. In other words, any change in the plant signature 
(its pole-zero characteristics) will increase the power of the error signal. The tester 
will detect this change and compare it to a previously defined threshold, according 
to the tolerated variation of the circuit. 




It is worth mentioning that not only basic components (resistors and capacitors) 
faults can be checked, but faults in the operational amplifiers used as well. Any 
deviation from the opamp ideal function will be recorded as another pole-zero 
signature at the s or z domain. Any fault present at the opamp will change this 
analog signature, being detected. 
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2.2 The adaptive algorithm and filter topology 

A finite impulse response (FIR) digital filter (Proakis, 1988) can be represented by 
a polynomial on its input signals delayed by a fixed amount of sampling times, like 

y(n) = Oq x{n) + a^ x{n-l) + x(n-2) + ... + jc(n-m+l). (1) 

In equation 1, y(n) is the output at discrete time n, while m is the number of filter 
taps. In an adaptive digital filter all coefficients (a^ ... a^_j) are determined at 
runtime. The questions to be answered are how to discover the correct set of 
coefficients, so that the output of the filter performs the desired functions and, also, 
the depth m of the filter. In case of an adaptive filter, the output should exactly 
mimic the output of the plant. There are available different algorithms ((Haykin, 
1991), (Widrow, 1985), (Mulgrew, 1988)), and their basic trade-off regards the 
amount of computations that must be performed against the speed of coefficient 
convergence. 

The filter adapts to any plant, provided it has enough taps (pair coefficient- 
sample) to represent all poles and zeros of the original plant. Actually, the adaptive 
filter matches the impulse response of the plant. To decide which kind of signal to 
be used in order to excite the couple filter-plant, one must note that this signal: 

1. must last long enough for the filter to converge (the error signal must go 
below a certain threshold), and 

2. must be rich regarding its frequency components, in order to provide 
excitation of the plant at all frequencies of interest. 

In our case, we choose one of the simplest adaptive algorithms, the Least Mean 
Square or LMS algorithm ((Haykin, 1991), (Widrow, 1985), (Mulgrew, 1988)). 
The algorithm is described in Figure 3. 



begin 

fl^(0)=0, i - 0...m-l 
loop 

y(n)=SUM(a^(n)*x(n-0) , i = 0...m-l 
error{n) = d(n) - y(n) 
ai(n+l) = a^in) + \i*error(n)*x(n-i) 
end 



/* filter */ 

/* error */ 

/* coefficients */ 



Figure 3 LMS Algorithm 



Once one knows the plant to be mirrored, the sampling frequency should be 
determined by an evaluation of the plant fastest response to be tracked by the test. 
In other words, the frequency response of the plant must be under the Nyquist 
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limit, fs/2. The maximum sampling frequency is a function of the data acquisition 
system. Since the goal is to test analog circuits, the number of poles and zeros is 
known a priori, so the frequency response of the system is already known. 

The designer can now define the number of taps to be used by a set of relations 
shown in equations 2 and 3. To define the number of taps, one should know the 
sampling frequency (fs) and have an estimate for the duration of the plant impulse 
response (T/r). 

The |Li parameter can be chosen inside its upper and lower bounds, presented in 
equation 4 ((Haykin, 1991), (Widrow, 1985)), which are related to the power of 
the input signal applied to the system. The p parameter can be a trade-off between 
speed of convergence (large p) or precision in the error signal (smaller p). 

Ntaps = Tir/Ts, (2) 

Ts=m ( 3 ) 

0< p< 2/(SlJM(x(n-i) x(n-i))), i=0,l,..,,m-l. (4) 

In equation 4, x(n) is the input signal sampled at time n, and m is the number of 
filter coefficients. 

2.3 Adaptive filtering used to on-line analog testing 

In order to excite the plant conveniently to detect the largest number of faults, the 
input signal should be such that all poles and zeros of the plant are properly 
excited. Any variation in this set of poles and zeros caused by any kind of fault 
will change the basic plant, and then be detected. The best signal would be one 
having all frequencies represented in its spectrum. 

Our first approach was to use the impulse response of the system, for in an 
impulse signal all frequencies are present. The impulse response of a circuit 
applied to circuit diagnosis was partially used in (Su, 1995). The approach of using 
the impulse response of the circuit, however, is quite dangerous. An impulse can 
move some linear circuits out of their linear region of operation. Also, defining the 
correct finite amplitude of a theoretically infinite impulse signal while maintaining 
linearity could be quite tricky. Since the goal is to develop on-line testing, a better 
and more easily generated input signal was needed. 

White noise has, by definition, an equally distributed power in its frequency 
spectrum. White noise can be easily generated using a random number generator 
and a DA converter. This way, white noise was the second natural choice. 
However, when used to on-line testing of analog circuits, the white noise generator 
might not be available, because of economical reasons (a DA converter might not 
be available). In this case, in order to adapt the filter, one might use the same input 




58 



Part Two Microsystem and Mixed-mode Design 



that is used to the analog circuit during normal operation to run the adaptive 
algorithm. 

In case of circuits for the audio bandwidth, for example, voice and music signals 
are extremely rich in their frequency spectrum, so that the filter can be adapted 
with enough frequency information. On the other hand, in case of instrumentation 
signals (for example, strain gauges), there is generally interest on a single 
frequency component. In this case, the on-line filter can adapt to the specific 
frequency, and any deviations will also be detected, as it will be shown. 

An AD converter and a small C25 Digital Signal Processor can tackle the 
concurrent test to be carried, constituting a low cost on-line tester. In order to 
program the tester, the designer can take two tracks. The first one is to define a 
good plant at an abstract level based on his/her design, compute filter coefficients 
by any mathematical tool (Matlab, Mathcad, etc) and then use this as the reference 
plant. Another strategy is to develop of a fault free prototype, and then use the 
system to adapt the filter and save the discovered coefficients to be used during 
normal operation and concurrent test procedure. 

Although all examples in this paper were developed with the C25, some simpler 
hardware like a dedicated bit-serial filter could also be used, lowering HW costs in 
the case of full custom integrated circuits solutions. 



3 CASE STUDIES 

In order to validate the proposed on-line test methodology, two circuits were used. 
The first one is a simple integrator, used to help the explanation of our 
methodology, and was just simulated. The Biquad is a little more complex, and 
was built and tested for different variations in some of its components. 



3.1 Integrator 

The integrator used is shown in Figure 4. The transfer function of this circuit can 
be found from elementary circuit analysis to be: 



H{s) = 



-1 

/?,C 



s+- 



R^C 



(5) 



For the simulation of this plant in the digital domain we take its z-transform 
((Ogata, 1970), (Widrow, 1985), (Proakis, 1988)), in series with a zero-order hold: 
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H(Z) = 



Ri 



Izil. 



Jc — — ^ — 

/V or' 



RjC 



( 6 ) 



where T is the sampling period. From this equation we have obtained the recursive 
equations for the simulation of the plant, using the values for the circuit 
components presented in Figure 4. 

The system was simulated using Matlab. We have used the rand function to 
generate the input to the circuit, and a FIR adaptive filter with 60 taps was adapted 
for 3000 samples. After conversion, the filter coefficients were stored and a new 
plant was defined, changing the value of C from l.OpF to 0.95|xF (a -5% 
variation). The resulting output error of the two cases is shown in Figure 5 (a)(for 
the identified system) and in Figure 5 (b)(-5% variation). The estimated output 
error power ratio has been found to be 13.3475 for this case. Both figures are at the 
same scale. 



c 




Figure 4 Simple Integrator, where Rl=le3, R2=le3, C=le-6 and T=0.1e-3. 




(a) (b) 

Figure 5 (a)Error output of a good plant (identified system) and (b)error output of 
the integrator with a variation of 5% in C 
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3.2 Biquad filter 

The Biquad filter was used to validate the methodology and a prototype circuit was 
built from discrete components. The circuit is shown in Figure 6, with the nominal 
values that were used. 

We have used a PC with a DSP board with an analogue interface (16 bits 
resolution, with reconstruction filters) as the input generator, and a C25 board with 
an AD converter (14 bit) as the on-line testing system. The sampling frequency 
used was 8khz per channel and we used a 48 taps LMS adaptive filter. The C25 
board sampled the input and the output of the plant. The analogue interface used a 
multiplexed AD without anti-aliasing filters. Data was not sampled simultaneously. 
The experiment block diagram is shown in Figure 7. The output signal was 
generated from the C25 board using a 8-bit DA converter. Note that all 
calculations use fixed point arithmetic (16 bits). 




Figure 6 Biquad schematic, with R1,...,R6=10K2, Cl,C2=20nF 




Figure 7 Experiment block diagram 

We have used 40 taps for the adaptive filter, and adapted the system during 3000 
samples. All data processing after acquisition was made in the C25 board. The 
processor has a 16 bit multiplier, and all filter coefficients were computed with 
fixed point multiplications. 

The output error, the input signal and the filter output were measured for the 
identified system, and are shown in Figure 8 (a)(top-down). In this case, the 
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system tester was adapted with an 4KHz bandwidth limited white noise, emulating 
some voice or music information. The error for a +10% variation in C2 is shown in 
Figure 8 (b)(both figures at same scale). 




(a) (b) 



Figure 8 Error signal (lOOmV/div), input signal (200mV/div) and filter output 
signal (lOOmV/div) (a)for the identified system and (b)for a 10% variation in C2 

In Figure 9 (a) we have adapted the tester to a single frequency (0.16kHz), 
emulating the case when the system should be used in some instrumentation 
process. The same variation was applied to the plant (10%) and the results are 
shown in Figure 9 (b). 





Figure 9 Output error (lOOmV/div), input signal (200mV/div) and output signal 
(5(K)mV/div) (a)for an excitation with a single frequency and (b)for an excitation 
with a single frequency with 10% variation in C2 



In case of the Biquad all tests of the proposed methodology detected small 
component deviations. One should also mention that the test is inherently simple 
and fast: there is not the need to choose the correct set of input frequencies, neither 
to verify more than one variable like the gain or phase. Figure 10 (a) shows the 
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response of the system to a large deviation (50%) of C2, while Figure 10 (b) 
presents the response of the tester to a catastrophic fault (C2 is shorted - note filter 
output). Again, the error signal is clearly different from the normal case, and is 
available either at the digital or analog domain. Notice the scale difference of the 
error signal between both pictures (0.1 to 0.5 V/div). 





(a) (b) 

Figure 10 Output error, input signal and filter output (a)for a 50% deviation in C2 
and (b)for a short in C2. 



4 LIMITATIONS OF THE APPROACH 

Although conceptually simple and easy to use, the methodology is based on some 
assumptions that must be fulfilled. The first is that an AD converter with enough 
resolution and speed is available. The C25 processor can be considered to be a low 
cost one, but since we are using it as a simple filter, a custom digital signal filter is 
not out of reach, even with low cost FPGAs. 

Adapting time is quite fast, and even when the filter might have a large number 
of coefficients to be settled, this is done only once every time a circuit is powered. 
Alternatively, filter coefficients could be stored in a E2PROM to be always 
available. 

Since adaptive algorithms are mathematically based on the Z transform, they can 
have a DC component. In our case, however, we still can not detect gain faults 
separately, since they could be mixed with some low frequency pole or zero 
variations. Moreover, the concurrent testing procedure we use can not presently 
detect the faulty component. 

Finally, it should be mentioned that for circuits working at very small 
frequencies one would have to have a huge number of taps. To avoid this, since 
this means extra processing time to the C25 processor, the designer must use a 
smaller sampling frequency. Also, at high frequencies, one might have troubles 
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with the acquisition system. In this case, the acquisition AD converter must have 
enough bandwidth to respond to all the high frequency poles. 



5 CONCLUSIONS AND FUTURE WORK 

This paper has shown a new on-line test strategy able to precisely detect soft, hard 
or catastrophic faults in an analog circuit. Since we use a digital processor, 
migration from the SW implementation to a dedicated HW is at hand. Bit-serial 
digital processors are being investigated, in order to reduce filter costs and 
allowing fast integration of the tester, even with FPGAs. 

The method is quite robust, and is able to detect minor component variations or 
large plant changes caused by shorts or opens, without problems. Since the 
methodology does not assume any set of particular frequencies, it can be integrated 
to any analog circuit described as a transfer function in the s-plane or the z-plane. 
Since we use a digital signal processor to implement the filter, its adaptation to any 
analog circuit is straight forward, without any redesign need. 

In our future work we intend to expand the methodology to include AD and DA 
converters, as well as DC signal components. Moreover, since a saving in 
processing power means a lower tester cost, we are investigating the 
implementation of digital filter only with shifters, and not multipliers. Also, faster 
algorithms than LMS are to be investigated, as well as low computations ones. 
Finally, since the original plant is known by the test engineer, the possibility of 
detecting the specific faulty component by a modification in the convergence 
algorithm will be investigated. 
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Abstract 

In this paper a multi-mode signature analyser is proposed to be built into analog 
and mixed-signal integrated circuits. The analyser circuitry is based on analog 
integrators and can perform transient (with one integrator) and frequency (with two 
integrators) response compaction. The area overhead incurred can be very low, since 
the same signature analyser can be shared by several analog macros. Besides that, 
whenever integrators are available on-chip for functional purposes, they can be 
shared with the testing circuitry. Experiments carried on a Sallen-Key (transient 
response compaction) and on a biquad filter (frequency response compaction) 
showed that a very high coverage of catastrophic and parametric faults can be 
obtained by using the proposed multi-mode signature analyser. 

Keywords 

VLSI Integrated Circuit. Test. Analog and Mixed Signal Circuit. BIST. 

1. INTRODUCTION 

With the advances on analog-digital integrated circuits, faster and more complex test 
equipments are required to meet ever more severe test specifiations. An attractive 
alternative to simplify the test equipment is to move some or all of the tester 
functions onto the chip itself. The use of Built-In-Self-Test for high volume 
production of mixed signal ICs is desirable to reduce the cost per chip during 
production-time testing by the manufacturers. In addition, it helps to perform 
diagnosis in the field. 

In the past few years, many published papers have been concerned with the 
definition of DFT techniques, but few papers are conconed with the 
implementation of self-test capabilities [Alqutayri, 92-Vazquez, 95]. The 
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impiementation of self-test capabilities implies the use of either on-chip Test 
Pattern Generators (TPGs) or Output Response Analyzers (ORAs) or both. In case 
of analog or mixed integrated circuits, two types of Output Response Analyzers can 
be defined: Analyzers dedicated to multi-fi’equency response compaction called 
multi-frequency ORA and analyzers dedicated to transient response compaction 
called transient ORA. 

In a mixed circuit, the response compaction can be periomed either on a digital 
signal after analog to digital conversion [Nagi, 94] or directly on the analog signal. 
This paper deals with purely analog module that does not require analog to digital 
conversion. So, considering purely analog modules, only one proposition of 
frequential ORA has been published [Lubaszewski, 96]. And in the same way, only 
one proposition of transient ORA has been published [Renovell, 96]. The objective 
of this paper is to propose a purely analog Multi-Mode Signature Analyzer 
(MMSA) able to perform both multi-frequency and transient output response 
compaction. The ability of performing on-chip both multi-frequency and transient 
response compaction allows to take advantage of the complementarity of the two 
test approaches. 

2. THE MULTI-MODE SIGNATURE ANALYZER 

The analog Multi-Mode Signature Analyzer we propose is based on the use of two 
cascaded follower-integrators which can implement the transparent function or the 
integration function. Each follower-integrator is built from a configurable op-amp 
that has been proposed simultaneously by different authors [Renovell,95 - Vazquez, 
95] . The basic principle of the configurable op-amp is given in figure 1. 





Figure 1 The configurable OPAMP 

Considering a classical op-amp, a fully transparent module can be obtained by 
adding 4 switch-transistors in transparent mode Conf=l. Figure 1 gives the 
symbollic representation of the configurable op-amp with the additionnal input Vj 
and the control input Conf. The configurable op-amp can be used to built a 
follower-integrator. The basic principle of the follower-integrator is given in figure 
2 using a switch capacitor implementation. When the control input Conf=l, the 
op-amp by-pass the passive R and C components and operates as a follower. When 
Conf=0, the op-amp ignores the Vy input and operates as an integrator. 

It is well known that the input offset of operational amplifiers can lead to the 
output saturation of pure integrators. In case the OTor induced by the offset cannot 
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be tolerated, we have chosen to embed an autozeroing circuitry [Gregorian, S6] to 
cancell the offset. Then each integration stage can be modifled according to the 
scheme given in figure 2. For simplicity, the module is repesented without offset 
cancellation in the following sections. 




Figure 2 The follower-integrator module 



By cascading two follower-integrators, we obtain the Multi-Mode Signature 
Analyzer as illustrated in figure 3. The different modes of operation are controlled 
by vector (Confl,Conf2). When the configuration vector (Confl,Conf2)=00 is 
used, the Multi-Mode Signature Analyzer performs a double integration and 
implements a multi-frequency ORA. When the configuration vector (Confl,Conf2) 
= 01 or 10 is used, the Multi-Mode Signature Analyzer performs a single 
integration and implements a transient ORA. The detail of these modes are given in 
the following sections. Finally, when the configuration vector (Confl,Conf2) =11 
is used, the Multi-Mode Signature Analyzer does not perform any integration and 
implements an analog SCAN register controlled by switchs SI and S2. 




3. THE MULTI-FREQUENCY RESPONSE ANALYZER 
3.1. Principle and Implementation 



A multi-frequency ORA is obtained if the configuration vector (Confl,Conf2)=00 
is selected in the multi-mode signature analyzer with S1=S2=1. A signature is 
obtained by computing the time for the output of the second integrator to reach a 
predefined reference voltage (Vref). Assuming Vin= -Vosin(o)t+(p) and Vclt=0=0. it 



can be shown that Vnud = - 



Vo 



on 



cos(p-cos(on+ 



?>)] (1) and 



= — ^^[-sin(a* + (p) + c*.cos^ + sin^] (2) where T = /?C = 



fcK.Coi 
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Considering no phase shift in the circuit under test (<p=0), (2) becomes: 

V«, = -^[(af-sin((af)] (3) 

(m) 

Figure 4 shows the integration of a signal with Vo=lV and f=500Hz, with a time 

constant t=8.10“^s. Due to the initial coidition Vcli=0=0, the first integration 
results in a signal shifted above analog ground according to (1). Since (p=0, the 
second integration gives an increasing (monotonically) value for Vout- 




Figure 4 Double-integration effects for Vq= 1 V, f=0.5kHz and T=0.8ms. 

3.2. Fault coverage 

Significant signal deviations result in general in different time signatures. 
Nevertheless, for smaller frequencies and larger input voltages, the time signature 
may be aRrroximaiely the same for a significant fiequency window, leading to a 
limited coverage of parametric (soft) deviations. This loss in fault coverage can be 
alleviated by considering larger time constants. However, increasing t, although it 
increases soft fault detection, increases also testing time (time required to reach 
Vref). Secondly, it also increases the fffobability of false rejection. Table 1 
summarizes the influence of x in the valid input space, in fault detection, false 
rejection and testing time. 
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Table 1 The effects of x changes. 
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Considering two signals f](t) and f2(t) of the valid input space, signature aliasing 
occurs when fi(t) and f2(t) result in approximately the same time signature: 

= = (4) ti»t2 

From equation [3] and neglecting the sin(witO terms, equation (4) implies: 

fi)l ~ fi)2 /l ~ /2 ^ 

Considering a linear aliasing behaviour as in equation (5), the aliasing can be 
illustrated by means of figure 5. Given a signal (V,0 in the input space, this signal 
is accq>ted to deviate in voltage/frequency an amount dV/df with respect to the 
nominal value. Considering a frequency deviation as in the case of signal (V,f) in 
figure 5, all signals within the highlighted cone should then leave the same 
signature. A deviation in voltage is illustrated few signal (V',f) in figure 5. 
Depending on the location of the signal in the input space, the aliasing cone will 
be larger for one type of deviation than the other. 

Giving then a correct signal (Vf) with its coiresponding signature, the probability 
that a faulty signal gives the same signature is the probability of the faulty sign^ 
falling within the aliasing cones. Considering that the faulty signal has the same 
probability of being anywhere in the input space, this can be calculated as follows. 




Figure 5 Aliasing probability computation. 



In the case of the frequency deviation cone for (V,0, the probability of aliasing is 
given by: 
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In the case of the voltage deviation cone for (V',f), the probability of aliasing 
gives: 
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Depending on the location of the signal, the size of its aliasing cone will be 
different, but figure 5 illustrates the worst cases to frequency and voltage 
deviations. For any signal, the worst case aliasing probability is then given by: 

/’alaaiiit~iu>>. = tnax(4f.^) (6) 



Then, assuming that deviations of ±5% are accqjted for the nominal ftequency and 
nominal voltage, the probability of aliasing in the worst case given by equation (6) 
will be 0.05. Finally, let us consider faults in the signature analyzer structure. In 
order to analyse the mutual influence of t, w and x, let us isolate t in equation (3) 
to Vout=Vref and derive the resulting expression with respect to x. Then, 

neglecting the term sin(wt), we obtain: = 2— urr (7) 

Sz Vo 

According to (7), the time signature t is more sensitive to x variations for large 
values of w (that justifies neglecting sin(wt)) and large values of x. Then, in ordo' 
to ensure a good coverage of the signature analyzer faults, very low frequencies 
should be avoided. 



3 J. Experimental Results 

A discrete version of the swiiched-capacitor multifrequency signature analyser was 
used for testing the biquad filter of figure 6. The signature analyser was 
implemented in protoboard with Cos=2.2nF and C=22nF. The power supplies wcae 
set at +5V and -5V. 




Two frequencies were applied to the circuit under test, fi=721Hz and f2=1442Hz. 
The fault set considered includes large deviations (-50% and +100%) of passive 
components. In order to reproduce in practice the simulation conditions of figure 4, 
signature analysis starts when, following a falling pass-by-zero of the test signal, 
the precomputed delay for the fault-free signal to reach analog ground ellapses. The 
consideration of this delay is not required in normal test application. The signature 
computation ends when the analyser comparator signals that a threshold voltage 
Vref^+4V was reached. Then the integration capacitors (C) are reset (Init=l in both 
integration stages), the signature analyser input is disconnected from the filter 
output, and the analyser input is grounded. This procedure is performed cyclically in 
order to provide the oscilloscope with periodic waveforms. Figures 7(a and b) show. 
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from top to bottom, the signals at the input of the analyser (Vin) and at the outputs 
of the first (Vmid) and the second integrator (Vout). obtained for the fault-fiee 
biquad at 721Hz and 1442Hz respectively. 

The experiments with the faulty filter consisted firstly in injecting component 
deviations of +100% and -50% (2.Q, Q/2, 2.Ri, and Ri/2) into the circuit and in 
computing time signatures. A 100% coverage is achieved for these passive 
component faults. The results obtained for the faults R6/2 and Ci/2 are presented in 
figures 7(c and d). Fault R(/2 is detected at 721Hz and remains undetected at 
1442Hz; fault Ci/2 is detected at 1442Hz and remains undetected at 721Hz. From 
the analysis of the signals Vjn (top), Vmid (middle) and Vout (bottom), it is clear 
that different signatures are obtained for phase and amplitude deviations. 

Secondly, although the precomputed fiequencies originally did not consid^ the 
detection of small component deviations, deviations of ±20% from the nominal 
values of resistors and capacitors were considered in our experiments. The fault 
coverage decreased slightly, especially due to components to which the filter output 
voltage is less sensitive for the frequencies applied. This is the case of capacity 
Cl. If other nodes are observed and/or other test frequencies are applied, better 
sensitivity figures may be obtained. Finally, all faults injected into the signature 
analyser (switch shorts and opens, and capacitor deviations) were also detected. 




(c) faulty biquad (Re/2) at 721Hz (d) faulty biquad (C,/2) at 1442Hz 



Figure 7 Filter test results: Vjn (lop), Vmid (middle) and Vout (bottom). 
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4. THE TRANSIENT RESPONSE ANALYZER 
4.1. Principle and Implementation 

A transient ORA is obtained if the configuration vector (Confl,Conf2)= 01 or 10 is 
selected in the multi-mode signature analyzer. When the Multi-Mode Signature 
Analyzer is configured with vector 01, the first stage corre^nds to an integrator 
and the second one to a follower. The integrator implements the transient response 
compaction function, the follower is not absolutely required. However, the follower 
can be used after compaction to memorize the Analog Signature. The two switches 
in figure 3 controlled by SI and Init are used for initialization and integration time 
control purposes. 

The proposed transient ORA can be considered as an analog counterpart to the 
cumulative addition technique used by digital signature analyzers [Rajski, 92]. In 
integration mode (Sl=l and Init =0), the integration function is performed with a 
time constant RC. If the integration mode is held for a time interval [0; Tj], the 

output value at the end of the integration time Vout(Tj) is equal to: 

DS = Oi-t- 02 -h ... -H On => AS = VoutCTj) = - (1/RC) j Vin(t) dt 



4.2. Fault Coverage 



To validate the implementation of the Multi-Mode Signature Analyzer in transient 
response compaction mode, the second order Sallen — Key filter of figure 8 has been 
studied. The validation is performed by SPICE simulations using a filter 
implementation with passive components in Oder to simulate hard faults on the 
passive components. Using a catastrophic fault model including 4 faults per MOS 
transistor (gate — drain and gate — source shorts, opens on drain and source contacts) 
together with 2 faults per passive component (short and open on resistances and 
capacitances), we determine the fault coverage of the filter. This fault covaage is 
calculated processing either the transient response 0(t) during the test interval, or 
the analog signature AS delivered by the analyzer at the end of the test time. The 
input test stimulus is a pulse signal. This type of test stimulus has been kkmtified 
as very efficient and is widely used for test purpose [Evans,90 - Alqutayri,92]. 

Hl- 



1 ^ 



K1 



C2" OO 



'HIE 



Xci 



o(t) 



Figure 8 Second Order Sallen-Key Filter 



Under these conditions, using a classical tolerance of 10% around the nominal 
response of the filler, 41 of 48 faults are detected when processing the transient 
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response. This off— chip time — domain analysis gives a fault covaage FC^gf equal 
to: FCfgf = 85.4%. 

In a second step, the fault coverage is established using the analog signature 
analyzer. The analyzer input is directly connected to the output node of the filter. 
Fault detection is then determined by a single comparison between the signature 
generated for faulty circuits and the pre— detamined fault— free one, taking into 
account the same tolerance of 10%. Simulation results demonstrate that 40 of 48 
faults are detected, that corre^nds to a fault coverage FCy^g equal to: FC^^^g = 

83.3%. Comparison between time — domain and signature analysis results shows 
that among the 41 faults normally detected when processing the transient response 
0(t), 40 faults are detected when observing the analog signature AS. So, for this 
example, the aliasing problem occurs for only one fault of the list. This result 
points out that the continuous integration function can be considered as an efficient 
analog compression technique and that the proposed implementation of the analyzer 
perfectly woiks. 

4.3. Improvement 

On the basis of the analog signature analyzer presented above, we define a 
multiple — input module. Indeed, the analyzer is based on the classical opamp 
integrator and this integrator exists in both single and multiple — input versions. 
So, it is easy to define a multiple — input signature analyze' implementation, as 
described in figure 9. During the integration mode (Sl=l and lnit=0), all the input 
signals Vinj, Vin 2 , .... Vin^ are integrated and the output signal Vout(t) is 

expressed by the following equation: AS = Vout(Tj) = - E (1/RiC) J Vinj(t) dt 

The area overhead induced by the replacement of a single — input analyzer by a 
multiple — input one is very low: it is sufficient to add one resistor and one switch 
per additional input, a single opamp and a single capacitor being required to perform 
the integration. 




Figure 9 Multiple — Input Analog Signature Analyzer 



This multiple — input signature analyzer can be considered as the analog countopart 
to the digital Multiple — Input Shift Register (MISR). Analog circuits often present 
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few output pins, and frequently a single output pin (for example when dedicated to 
filtering applications). Considering a single output circuit, the interest of the 
Multiple-Input Analyzer consists in the observation of additionnal internal nodes. 
In fact, in this situation the multiple — input analyzer adds some extra observation 
points in the circuit, thus increasing the global observability. Observability is 
commonly accqjted as one of the most important attribute of the testability, and 
so, by increasing the observability, we make the circuit more testable. 

To illustrate this idea, we study the fault covoage achieved for the previous 
Sallen — Key filter using a 2 — input signature analyzer. One input of the module is 
used to process the filter output node and the other constitutes the additional 
observation point. Obviously, this point has to be judiciously chosen to be able to 
get an increase in fault coverage. Starting with the remark that all the non — detected 
faults are faults inside the opamp, the additional observation point is chosen as an 
internal node of the opamp. So, the two analyze inputs are connected to (i) the 
output node of the Alter and (ii) the output node of the opamp differcntial stage. 
The SPICE simulations show that the new fault covaage FCy^ 5 * is equal to: 

FCas* = 97.9%. 

Keeping in mind that the reference fault coverage achieved by an off— chip time — 
domain analysis is equal to FCref=85.4%, we note a very significant increase when 
using the multiple — input analyzer. It has to be emphasized that usually, when 
using signature analysis techniques, the problem is to minimize the number of 
error masking phenomena in order to obtain a fault coverage as close as possible to 
the fault coverage achieved by an external test. With the multiple — input analog 
analyzer, not only we can partially overcome aliasing problems by introducing 
redundancy but also, we permit an improvement of the fault coverage. This 
solution consequently goes beyond the classical objectives of signature analysis 
techniques. 

5. CONCLUSIONS 

Similarly to the digital world, in the analog domain the off-chip analysis of output 
test responses requires a test equipment capable of extracting a high volume of data 
from the circuit under test and capable of properly processing these data. This kind 
of equipment is in general very expensive and takes a relatively long time to extract 
and process the test data. Those costs can be drastically reduced if a built-in output 
response analyser is used to perform test data compaction on-chip. Then a simple 
comparison of the computed analog signature to a reference value can provide a 
go/no-go indication outside the chip. 

In this paper a multi-mode signature analyser is proposed to be built into analog 
and mixed-signal integrated circuits. The analyser circuitry is based on analog 
integrators and can perform transient (with one integrator) and frequency (with two 
integrators) response compaction. The area overhead incurred can be very low, since 
the same signature analyser can be shared by several analog macros. Besides that. 




A multi-mode signature analyzer for analog and mixed circuits 



75 



whenever integrators are available on-chip for functional purposes, they can be 
shared with the testing circuitry. 

Experiments carried on a Sallen-Key (transient respwtse compaction) and on a 
biquad filter (fiequency response compaction) showed that a very high coverage of 
catastrophic and parametric faults can be obtained by using the proposed multi- 
mode signature analyser. It was also pointed out that existing aliasing problems can 
be alleviated by enhancing the observability of the circuit under test using the same 
signature analyser. In addition, the effects of the operational amplifiers offset on the 
integration stages were shown to be cancelled by an autozeroing circuitry. 

Given the results above, one can conclude that the testability of analog and mixed- 
signal circuits can be greatly improved by using the multi-mode signature analyser 
proposed in this paper. This improvement can be achieved at a rather low cost when 
compared to the prices of commercially available test equipments. 
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Abstract 

In this contribution, design process and implementation of a single-chip timing 
and carrier synchronizer and channel decoder for digital video broadcasting over 
satellite (DVB-S) is described. The device consists of an A- to- D- converter with 
AGC, timing and carrier synchronizer including matched filter, Viterbi deco- 
der including node synchronization, byte and frame synchronizer, convolutional 
de-interleaver, Reed Solomon decoder, and a descrambler. The system was de- 
signed in accordance with the DVB specifications. It is able to perform Viterbi 
decoding at data rates up to 56 Mbit/s and to sample the analog input values 
with up to 88 MHz. The chip allows automatic acquisition of the convolutional 
code rate and the position of the puncturing mask. The synchronization to the 
variable sample rates is performed fully digital by means of interpolation and 
controlled decimation. Hence, no external analog clock recovery circuit is nee- 
ded. For algorithm design, system performance evaluation, and co- verification of 
the building blocks an advanced design methodology was used. This guarantees 
both short design time and high reliability. The chip has been fabricated in a 0.5 
/xm CMOS technology with three metal layers. A die photograph is presented. 
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1 INTRODUCTION 

A modulation and channel decoding system for digital multi>program television 
broadcasting is standardized in the digital video broadcasting (DVB) standard 
(European Telecommunications Institute 1994). The satellite system is intended 
to provide direct-to-home services for consumer integrated receiver decoders 
(set-top boxes), as well as cable television head-end stations. For the consumer 
market inexpensive and reliable implementation solutions are required. The- 
refore, the goal of the system designer is to implement as much functions as 
possible on a single chip and to avoid the use of unreliable and expensive analog 
components. This has been taken into account during the development of the 
system which is presented here. 




Figure 1 Block diagram of the system 

A block diagram of the system is shown in Figure 1. The received LNB-output 
signal is prefiltered and down converted to the second IF (480 MHz). Filtered 
with an SAW filter this signal is fed into the demodulator. The demodulated 
analog I and Q signals are A-to-D converted on chip. Within the timing synchro- 
nizer timing offset correction and adaption of the sample rate to the (varying) 
symbol rate is performed. Carrier phase and frequency offsets are compensated 
for in the carrier synchronizer unit. The output of the matched filter is input 
of the de-puncturing unit of the Viterbi decoder which is controlled by a node 
synchronization unit. After byte and MPEG transport multiplex packet syn- 
chronization, the de-interleaved byte stream is fed into the Reed-Solomon (RS) 
decoder. The decoded information bytes are de-scrambled and put out to the 
MPEG2 decoder. 

For an inner code rate of 1/2 and an EbJNo of 4.2 dB a BER of 2x10“”^ 
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(behind the Viterbi decoder) is specified in the standard. This corresponds to 
quasi-error-free operation (BER of 10“^^) behind RS-decoding. 

Loop parameters, acquisition and tracking performances of all synchronizing 
units, and even acquisition strategies are configurable via a standardized IlC-bus 
interface. In addition, internal states and important system information can be 
read out. 



2 SYNCHRONIZATION OF QPSK SIGNALS 

The synchronization of PAM signals is a topic well explored and known to 
the research community. An overview over the subject is given e.g. in (Meyr, 
Moeneclaey & Fechtel 1997). 

In this project, we used two feedback loops for timing and carrier synchro- 
nization, which were completely separated. This approach leads to a simpler 
acquisition strategy, eases the specification of algorithmic parameter ranges and 
quantization, and increases design robustness. The timing synchronization error 
feedback loop proceeds the carrier synchronization loop in which the matched 
filter is embedded. Timing and carrier synchronization are described in more 
detail in the next two sections. 



2.1 Timing Synchronization 

Timing synchronization for continuous data streams is performed by a feedback 
loop consisting of a timing error detector, a loop filter and either a controllable 
VCO or a digital interpolator. The latter solution has several advantages: It 
allows to minimize the interaction between analog and digital circuitry (and 
hence reduces design time and test complexity); it allows to use cheaper analog 
components and its design is easier to handle as no joint analog and digital 
modeling technique has to be employed. 
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Figure 2 Block diagram of the timing and carrier recovery 
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In order to achieve a virtually carrier independent timing acquisition, the 
Gardner Timing Error Detector (Gardner 1986) which is known to produce an 
error estimate that will lead to timing estimates approaching the Cramer-Rao 
bound (CRB) (Oerder & Meyr 1987) is used in conjunction with a sampling- 
rate conversion NCO and a digital interpolator. A more detailed discussion of 
the developed algorithm for timing synchronization of variable sample rates can 
be found in (Lambrette, Langhammer & Meyr 1996, Lambrette, Langhammer 
& Meyr 1997). 

The structure of the timing synchronizer loop is depicted in the left-hand 
side of Figure 2. After interpolation and consecutive decimation, a loop-in-lock 
criterion is computed within the blocks ’power estimation’ and ’lock detection’. 
This in-lock criterion is evaluated in the acquisition control unit. Steered by 
the acquisition control unit, the loop filter processes the output samples of the 
timing error detector. The output of the loop filter is connected to a numeri- 
cally controlled oscillator (NCO) that provides the interpolator filters with filter 
coefficients and controls the decimation. 




The interpolator consists of two independently operating FIR filters with va- 
riable coefficients. These filters are implemented according to the modified bit- 
plane approach (Noll 1987) which yields a small siheon real estate in conjunction 
with a high sample rate. The number of full adder cells between two consecu- 
tive pipeline register cells has been chosen to be 2 in order to gain the smallest 
possible area while fulfilling the constraints on the data rate. This results in a 
modified structure (Vaupel & Meyr 1994) compared to (Noll 1987). The prin- 
ciple of the structure is depicted in Figure 3 exemplified by a filter with three 
taps and a coefficient word length of four bit. By means of re-ordering the add- 
operations the partial product with the smallest possible values are added up 
first leading to a smaller word length of the intermediate results. In order to 
increase efficiency, modified booth encoding of the coefficients is applied. 

2.2 Carrier Recovery 

Carrier Recovery (Meyr et al. 1997) is based on an NDA phase error detector 
that feeds a second order loop filter whose output is then passed to a phase 
rotator. Carrier recovery itself is performed at symbol rate, carrier frequency 
and phase correction is carried out before the matched filter and hence runs at 
the sample rate of 2/T. 

The structural block diagram of the carrier synchronizer can be seen in the 
right-hand side of Figure 2. The output samples of the interpolator are rotated 
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in a CORDIC (Voider 1959, Dawid k Meyr 1996) processor and consecutivly 
filtered in a matched filter, an FIR filter with fixed coefficients. Since the Viterbi 
decoder requires a sign-magnitude representation at the input, the two’s com- 
plement encoded samples are converted in the scahng block. Additionally, the 
output samples of the matched filter are scaled according to the input require- 
ments of the phase error detector and the carrier lock detection unit. Parameters 
of the carrier loop filter are set by the carrier acquisition control unit mainly. 
The output of the loop filter is accumulated in a numerically controlled oscillator 
that provides the CORDIC processor with the rotation angle. 




Figure 4 Block diagram of one branch of the matched filter 

The complex- valued matched filter is implemented as two identical but in- 
dependent FIR filters with fixed coefficients which are encoded in canonical 
signed digit (CSD) format in order to increase efliciency. Exploiting a carry-save 
representation as internal data format, the filters are implemented as rows of 
adder cells (bitplanes). Since investigations led to an optimum pipehne depth 
(the number of additions between two registers) of three, a rc-ordering of the 
bitplanes similar to (Vaupel k Meyr 1994) has been applied to reduce silicon 
real estate. In order to provide the adder cells with the correctly delayed values, 
the input samples are delayed in one shift register chain. Figure 4 shows the 
structural principle. 



3 VITERBI DECODER 

The Viterbi decoder operates on <dl DVB compliant code rates (1/2, 2/3, 3/4, 
5/6, and 7/8) by means of de-puncturing. It consists of the Viterbi core, a 
de-puncturing unit, an error correction rate (ECR) measurement unit, and a 
synchronization controller. The basic Viterbi decoder core consists of a transition 
metric unit (TMU), an add compare select unit (ACSU) and a survivor memory 
unit (SMU) with an implemented survivor depth of 128. The de-puncturing 
unit steers the input FIFO to convert the data rates according to the code rates 
and performs the actual de-puncturing according to the current synchronization 
state. It is able to perform a 90 degree rotation of the received QPSK symbol 
prior to the actual de-puncturing for synchronization purposes. Since up to 4 
QPSK symbols belong to one de-puncturing period (for code rate 7/8) an offset 
is input to the unit to be able to adjust the de-puncturing sequence to possible 
offsets of the received sequence. The error correction rate (ECR) of the Viterbi 
decoder, ie the rate of different bits between hard-decisions and the re-encoded 
data stream, is detected. Tliis rate is an estimate of the hard error rate of the 
channel and can thus be used to estimate the channel SNR. The synchronization 
controller performs node synchronization automatically, based on a choice of 
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Figure 5 Structural outline of the Viterbi decoder 



programmable code rates and thresholds on the correction rate which indicate 
out-of-sync conditions. 



4 FRAME SYNCHRONIZATION AND CONVOLUTIONAL 
DE-INTERLEAVING 

The frame structure of the interleaved data is depicted in Figure C. An MPEG-2 
transport MUX packet consists of 187 information bytes and one leading sync 
byte (47 hex). The RS-encoder adds 16 byte redundancy to each packet. Each 
eighth packet (super frame) is indicated by an inversion of the sync byte. On 
the transmitter side, all data bytes beside the sync bytes are scrambled prior 
to RS-encoding. This structure of the data stream is exploited in the frame 
synchronizer to perform 

1) byte synchronization of the infinite bit stream 

2) frame synchronization, which is needed to synchronize the deinterleaver and 
the RS- decoder 

3) resolving the 7r-ambiguity of the output data stream of the Viterbi decoder 
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Figure 6 DVB frame structure 

The acquisition and tracking performance can be controlled via the IIC bus. It 
depends on the bit error rate. For a typical parameter set and a BER of 2*10“'* 
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the mean time until detecting in-sync correctly is below 0.5ms and the mean 
time until loss-of-sync is above 10^°s. 

The error protected packets of 204 bytes are interleaved in the transmitter. 
Therefore, a deinterleaver has to process the byte stream before the RS-decoder 
is able to decode the packets. 

In principle, the deinterleaver is a convolutional interleaver with / = 12 bran- 
ches (Forney 1971, Ramsey 1970). Each branch consists of a shift register with 
M * (11 — j) cells (M = 17, j branch index). Each register has a wordlength of 
eight bit. The data are (de)interleaved byte-wise. For synchronization purposes, 
the (inverted) sync bytes are always routed to branch ”0” of the deinterleaver 
(see Figure 7). Due to the large consumption of sihcon real estate, implemen- 
ting the deinterleaver using register cells would be very inefficient. Instead, a 
RAM-based solution was implemented. In order to obtain the minimal possible 
memory size, an addressing scheme was developed .that allows in-place updating. 




Figure 7 De-interleaving scheme 



5 REED SOLOMON DECODER 

The DVB standard specifies a shortened (204,188) Reed Solomon (RS) code. 
One codeword consists of 204 bytes, separated into 188 information bytes and 
16 parity check bytes. Since errors-only decoding is employed (no erasure pro- 
cessing), the RS decoder is able to detect and correct up to t = 8 byte errors per 
codeword (a byte error specifies an erroneous byte, independent of the number of 
corrupted bits), which can be arbitrarily distributed within the data and check 
locations in a codeword. This code is designed to achieve QEF (quasi error free) 
performance. The code is characterized by the code generator polynomial 

mo4'£^~2 

g(z) = n (^ - “') 

»=mo 

with mo = 0 as specified in the DVB standard. The DVB Reed Solomon Decoder 
(RS) uses a finite field GF(2®) which is specified in the DVB standard by the field 
generator polynomial f{x) = 1. For the DVB application the 

’’classical” method, given by syndrome calculation in the frequency domain and 
calculation of the error locator and evaluator polynomials using the B^rlekamp- 
Massey algorithm, is considered to be optimal. 

The whole decoding process, which has to be performed for each codeword, 
can then be coarsely divided into the following steps: 
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• Syndrome calculation 

• Calculation of the Error Locator and Evaluator Polynomials 

• Chien Search (Determination of the roots of the Error Locator Polynomial) 

• Calculation of the correction values 

• Correction and output of the codeword 

These steps are reflected in the top level structure which is shown in Figure 8. 
Due to the high throughput requirements, every block is implemented as a se- 
parate hardware unit. 




Figure 8 Block diagram of RS decoding architecture 

Given a syndrome, a time budget of 204*4 = 816 clock cycles is available for 
solving the key equation using the Berlekamp-Massey Algorithm. In order to 
minimize area consumption while meeting this throughput constraint, a special 
ALU supporting Galois fleld arithmetic was developed (see Figure 9). 




Figure 9 Galois Field ALU 

The polynomial coefficients are stored intermediately in two register files, one 
for the and one for the A polynomial. A large hard-wired state machine 
steers the operations in the ALU and the register files. This design approach 
leads to a highly efficient implementation of the Berlekamp-Massey algorithm, 
implementing exactly the amount of parallel processing necessary to meet the 
given throughput constraint. The input data which is stored in a dual port RAM 
(the codeword buffer) is finally read out and corrected. 
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6 DESIGN METHODOLOGY 

All algorithm design and system performance evaluation was performed using 
the system level design tool COSSAP (Synopsys 1996). The performance of the 
frame, carrier and timing synchronizer was calculated partly in conjunction with 
Matlab (The MathWorks 1994, Lambrette, Schmandt, Post & Meyr 1995). Only 
the synchronizers for timing, carrier, node sync and frame start required specifi- 
cation of algorithms and fixing algorithm-specific parameters (loop-bandwidth, 
threshold values), designing the remaining blocks did not require major algo- 
rithmic investigations. 

For each of the building blocks the environment was modeled. System simula- 
tions paved the way from fioating point to quantized integer models. The VHDL 
descriptions of the components were verified against the corresponding system 
level COSSAP blocks using a coupling of the system simulator and the VHDL 
simulator (Zepter 19936, Zepter 1993a). Therefore, no VHDL testbench had to 
be written in the course of this project. This led to significant savings in design 
time compared to a more conventional HDL-based methodology. 

COSSAP was well suited to the modeling of the dynamic data-dependend 
data flow, imposed by the controlled decimation in the timing synchronizer. In 
hardware, this dynamic dataflow was realized using gated clocks. Corresponding 
to the three different sample rates I/T5, 2/T, and 1/T (see Figure 2) we have 
three different clock domains in the symbol synchronizer. (A fourth clock do- 
main drives the viterbi decoder and the consecutive units.) For synchronization 
purposes between these domains adjacent clocks are negated. That means that 
each transition from ’low’ to ’high’ of the clock corresponding to 2/T occurs on 
a falling edge of the clock corresponding to I/T5 only. The COSSAP system 
model takes this into account. Therefore, for the presented system a hierarchical 
COSSAP model exists which is bittrue and cycle-true identical to the VHDL 
model. 

Using the COSSAP design flow, seamless design verification was possible 
throughout all design stages. Synopsys’ Design Compiler was used for logic syn- 
thesis of the RTL VHDL code. Test pattern for the resulting gate level netlist 
and even for post production testing were also generated from COSSAP. 



7 PERFORMANCE 

For a symbol rate of R = 33MHz, frequency offsets of ±12.5% normalized to the 
symbol rate and a typical parameter setting, the acquisition time of the carrier 
synchronizer is below 20 ms and the acquisition time of the timing synchronizer 
below 2ms. The mean acquisition time for the frame synchronizer is about 0.5 
ms. 

In order to assess the performance of the synchronizer, bounds must be esta- 
blished for the performance of the synchronizers as well as for the overall system. 

The most important measure for the performance is the bit error ratio which 
is measured behind the Viterbi decoder that follows the synchronizer. Usually, 
an ideal implementation reaches the theoretical bounds of the error ratio. Any 
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degradation from this bound is due to implementation effects like quantization or 
clipping. A detailed performance analysis also relating the bittrue synchronizer 
performance to the Cramer-Rao bound can be found in (Lambrette et al. 1996, 
Lambrette et al. 1997, Meyr et al. 1997). 



BER performance bitfrue receiver 



I 4 simulation results for R«7/8, R*1/2 




Eb/NO [dB] 

TsA=0-^508. QT=»0.03 



Figure 10 Bittrue Receiver Model Performance 

Figure 10 displays the resulting Bit Error Ratio as a function of the E^INq. In 
the figure, the ideal implementation (no synchronization errors, no implemen- 
tation loss due to survivor path truncation or quantization) is compared to the 
bittrue model of the receiver also including impairments of the A/D converter. 
The overall degradation is about 0.4dB, leaving 0.6dB implementation loss for 
the other analog circuitry. 



8 IMPLEMENTATION 

The chip was implemented in a 0.5 jim CMOS technology with three metal 
layers. The single supply voltage is 3.3 V. The power consumption amounts 
to 1.2 W at a maximum sampling rate of the analog input values of 88 MHz. 
The majdmum output bit rate figures up to 56 Mbit/s. In Table 1 the relative 
standard cell areas of the main components and the normalized silicon areas 
including analog components and RAM are summarized: 

Figure 11 shows a chip photograph. The data flow direction is from left to 
right. On the upper left corner the two clock synthesizer PUs for the synchronizer 
and the channel decoder are located. Below these, the A/D-converters for in- 
phase and quadrature component were placed. The memory blocks in the middle 
are the RAM’s of the survivor memory unit which enclose the Viterbi decoder. 
On the right hand side, the memories for the deinterleaver (at the top) and the 
Reed-Solomon decoder (bottom) can be seen. 




An all-digital single-chip symbol synchronizer 



89 



Table 1 Cell and chip areas 



Component 


acami. cell area 


sihcon area 


Synchronizer 


32 % 


9% 


Viterbi 


40 % 


48 % 


RS-decoder* 


28% 


17% 


A/D- converter 




4% 


Clock-pn 




1% 


Pad frame 




21 % 


Sum 


100 % 


100 % 



*incl. deinterleaver, frame sync, and descrambler 




Figure 11 Chip photograph 



9 CONCLUSION 

The implementation of a single-chip timing and carrier synchronizer and channel 
decoder for digital video broadcasting over satellite (DVB-S) was described. Due 
to the digital timing and carrier synchronization, the number of external compo- 
nents has been minimized. The chip is fully comphant with the DVB-standard 
and allows automatic acquisition of variable symbol rates and convolutional code 
rates. The design methodology presented ensures both short time to market and 
high design integrity. 
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Abstract 

The paper presents a high performance ATM Switching Element with 
programmable switching capacity from 8 x 8 to 32 x 32 links, each at 311 Mbit/s. 
The 8x8 component is a single fully CMOS IC, the others can be obtained by 
properly connecting the 8 x 8 basic elements in Multichip Modules. It is shown 
how to connect them in order to avoid blocking problems, to increase the shared 
buffer availability and to improve the global system capacity of a factor of 2 or 4. 
The expanded Switching Element is a single component in terms of functionality 
and represents a single-stage in the ATM Switching Fabric. A MCM 16 x 16 
Switch is also described and its features are presented. Both the components find 
an industrial application in the Italtel Cross Connect UTXC, which, using these 
elements, can provide systems from 5 up to 160 Gbit/s. 
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1 INTRODUCTION 

The Asynchronous Transfer Mode (ATM) has been identified as the backbone 
technique for transport, multiplexing and switching on the B-ISDN [Pry91]. ATM 
systems are based on powerful integrated circuits both at the user and at the 
network management level. Nevertheless in order to preserve investments, 
flexibility and modularity both at the architectural and at the IC level must be 
taken into great account. The system manufacturers must provide nodes with 
different costs, specifications, dimensions and capacity according to the customer 
needs. Therefore the architectural solutions must be modular and easy to upgrade 
from medium-low (5 Gbit/s) to high capacity (several hundred of Gbit/s). As a 
consequence even the ICs should be as more flexible as possible both in 
specifications and capabilities. The paper focuses on ATM Switching nodes and 
mainly on the Switching Element (SE) that is the real core of any ATM Switching 
Fabric (ASF) [Bal92]. The global switching capability and throughput of the node 
is highly dependent on the SE speed and capacity. SE most common speed and 
switching capacity can vary respectively from 155 Mbit/s to 622 Mbit/s per any 
link and from 4x4 to 32 x 32. Anyway depending on specifications and 
complexity, components are getting more and more complex sometimes reducing 
yields and increasing costs and power dissipation. Nevertheless working just on 
increasing switching capability force to reduce other specifications like 
controllability, debugging, flexibility, programmability that at an industrial 
viewpoint are sometimes even more important. The solution here presented is 
focused on a SE of medium complexity (8 x 8) at 311 Mbit/s maximum speed per 
link (BASE8). The technology is CMOS, the complexity is about 
700,000 transistors, the power dissipation is 2.3 W at the working frequency. 
Using the MCM technique the component can increase its switching capacity from 
8x8 to 32 X 32, the 8x8 SE contains a specific logic, programmable by the 
microprocessor that allows to program it as a sub-stage of a 16 x 16 or 32 x 32 
element. Architectures and solutions are proposed in order to increase the 
switching capability as required by the system; any architecture realises a 
component, seen as a single-stage by the system. An interesting exploitation is the 
16 X 16 SE (MCM-BASE16), designed by CSELT and realised using the MCM-C 
technique by IBM. This component, thanks to its 5 Gbit/s throughput and 
packaged in a 1073 CCGA, allows to double the switching capability of the 
complete system. 



2 THE SWITCHING ELEMENT FEATURES 

The N X N switching element can switch n input links onto n output links, with 
n = K*8, using basic elements 8x8, according to the detailed routing information 
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laid down in the header of the data stream. All the n input/output links have a 4 bit 
parallelism, with their own nibble clock, that defines each input/output link 
frequency and a cell clock. The component internally provides the alignment of the 
incoming cells with the master internal clock. The cells can be switched to a virtual 
path or to a distributed path with or without the exclusion of one output link. To 
avoid traffic congestion and blocking problems many features are developed. A 
cell loss priority principle is implemented: when the stored cells overcome a 
threshold, programmable by the microprocessor, low priority cells are discarded, 
signalling it to the control microprocessor. If back-pressure function has been 
activated a flag is sent to the previous stage, that will send out from that connection 
only empty cells. Priority and back-pressure thresholds are programmable. The 
Switching Element architecture corresponds to a model with output queues, 
realised with the complete sharing of a large memory properly addressed in order 
to obtain a spatial switching of the cells towards the outputs. The outgoing cells 
can be sent out at different speed, programmable link by link. 

The component can be configured by an external microprocessor: setting 
thresholds, output links frequency, forcing empty cell during the initialisation 
phase. The component is speed programmable up to 311 Mbit/s per link, both in 
input links, using the nibble clock, and in output links switching capability of the 
complete system. 



3 THE 8x8 SWITCHING ELEMENT 

The Switching Element architecture corresponds to a model with output queues, 
realised with the complete sharing of a large memory properly addressed in order to 
obtain a spatial switching of the cells towards the output. 

The 8x8 SE* switches 8 input links (ILINK) onto 8 output links (OLINK) as 
shown in Figure 1 [Tur94]. Starting from the master clock MCK, externally 
received, the Phase Generator unit creates the internal time base (intClocks). 
Everything in the device is synchronous with the mentioned time base. 

In the DECLINK unit the component internally provides the alignment of the 
incoming cells with the master internal clock. At the reception of the cell clock 
signal, the input data are registered in peripheral buffers, that, contain half a cell 
and let manage even the input link speed, when different from the internal master 
clock. When one-fourth of a cell is synchronised the reading of the buffer is 
activated. Beyond the synchronisation as the main function, the DECLINK unit 
carries out the first checks on the cell: whether the cell has the proper length 
(64 bytes), whether the routing tag is correct (parity checks), whether the cell is an 
empty or a test one and whether it has to be discarded for thresholds overcoming 
(priority or overflow on Shared Buffer). 



' Patent Granted 
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The Tag Analysis & Address Generation unit properly address the large memory 
(Shared Buffer), where the ATM cells are stored. The Rotation Memory adapts the 
data from the external format (4 bits) to the internal memory format (128 bits) and 
vice versa. 

A Microprocessor Interface handles all information which has to be exchanged 
between the SE and the external microprocessor such as: diagnostics cares, 
microprocessor commands, operation and maintenance instructions. Test cells can 
be extracted from and inserted into the network stream for debugging purposes. 
The OUTLINK unit adapts the outgoing data from the internal frequency to the 
progranunable output link speed. 




Figure 1. 8 x 8 Switching Element architecture. 

High speed and low power dissipation goals strongly drove technology and 
design choices. In order to achieve such objectives a fully CMOS technology 
(0.6 |Lun) has been chosen and great care has been taken in the electrical design of 
critical parts. Thanks to a fully CMOS realisation and to the interesting design 
solutions the circuit points out with respect to other ATM switching elements, 
designed in BiCMOS technology, for its power dissipation, for its high working 
frequency, that implies a very high throughput, for its architectural and designing 
flexibility and above for the system features addressed to get the component an 
industrial product. 

The 8x8 circuit works with an internal clock of 77.8 MHz (referring to the 
UTXC application). We designed full custom the critical structures and we used 
Standard Cells for glue logic and non critical parts. We customised I/O pads and 
the internal memories. 

The circuit uses special I/O pads^ fully CMOS, with a reduced swing (OV, 
1.5 V). In CMOS circuits, dynamic power consumption is proportional to the 



^ Patent granted 
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transition frequency, capacitance and square of supply voltage: reducing supply 
voltage, consequentially, provides significant power saving. 

The fast CMOS output driver can work at a rate of 200 Mbit/s, with an eye 
pattern opened more than 80 %. The driver, that uses series termination, is 
designed with an output impedance equivalent to the line impedance. The output 
impedance can track the line resistance with an accuracy of 10 %. Thanks to the 
series termination approach (there is dissipation just in transition time), the power 
consumption is about 30 mW at 100 MHz. 

The Shared Memory is a synchronous static RAM (512 words x 128 bits); each 
word stores one-fourth of an ATM cell, therefore a complete cell is written into 
four locations of the RAM with contiguous memory addresses. This represents the 
best trade-off between word size and memory access time constraints for the 
physical realisation of the system. Particular electrical solutions have been 
implemented in order to reduce the power dissipation due to the 128 I/Os. Bitlines 
dynamics was reduced, the SRAM has been clocked only when working and a 
latching Sense Amplifier^ (Figure 2) with a minimum power dissipation has been 
designed. It works in input with a reduced dynamics of 250 mV, with more than 
4.5 V common mode voltage. A specific structure allows to automatically turn-off 
the sense amplifier biasing and its static power dissipation, it settles in about 2.1 ns 
at the worst comer of the process. 




Figure 2. Sense Amplifier Power Dissipation and layout. 

Thanks to these solutions the global power dissipation has been limited to 2.3 W 
at 77.8 MHz and it has been a key point for a successful usage of the component in 
the MCM technology. 



^Patent granted 




96 



Part Three Communication and Memory System Design 



The Rotation Memory is a large two-fold shift register structure, implemented as 
a stack of dynamic shift registers split into 2 planes. To guarantee a high working 
frequency, the 256 input/output wiring is reduced fitting the layout of this module 
with the SRAM step. 

Particular synchronisation and buffering structures were implemented, by using 
local latching techniques and balanced length wires, in order to avoid skews 
among clocks and controls. 



4 THE SWITCHING CAPACITY EXPANSION 

A 8 X 8 SE can be used to implement a non-blocking 16 x 16 or 32 x 32 
single-stage architecture by using the Multichip module technology. 

The routing information is extracted by any stage reading a proper Routing Byte 
(one for any network stage). In this byte, five bits (called TAG) identify the Output 
link and two bits specify the Routing Mode (Virtual, Distribution, Distribution 
with Exclusion). The single-stage architecture is organised in sub-stages of 
8 X 8 SE and each sub-stage uses a three bits subset (called SUBTAG) extracted 
from the same TAG. 

Using this techniques ^ a 16 x 16 architecture, with four SE organised in two 
sub-stages (see Figure 3), and a 32 x 32 architecture, with twelve SE organised in 
three sub-stages, can be realised. More complex architectures can be realised just 
incrementing the number of sub-stages. Using a single TAG for any single-stage 
architecture is the real key point. 



lUNK 0 
lUNK 1 
lUNK 2 
lUNK 3 
lUNK 4 
lUNK 5 
lUNK 6 
lUNK 7 



lUNK 8 
lUNK 9 
lUNK 10 
lUNK n 
lUNK 12 
lUNK 13 
lUNK 14 
lUNK 16 
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. OUNK 2 
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OUNK 1 
OUNK 3 
OUNK 5 
OUNK 7 
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OUNK 11 
OUNK 13 
OUNK 15 



Figure 3. 16 x 16 Single Stage Architecture. 

Any base element extracts the proper 3 bits SUBTAG shifting by one. 

Two routing modes can be set by the microprocessor: Homogeneous Routing 
and Mixed Routing Mode. 



^ Patent pending 
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In the Homogeneous Routing Mode every sub-stage switches with the same 
Routing Mode indicated in the Routing Byte. For example in the 16x16 
architecture an ATM cell with Virtual Routing Mode and TAG = 01 101 is routed 
by the first sub-stage according to SUBTAG = 101 (out 5) and by the second sub- 
stage according to SUBTAG =110 (out 6); the cell is finally routed to link 
OLINK 13. 

In the Mixed Routing Mode only the last sub-stage switches with the Routing 
Mode indicated in the Routing Byte, whereas all the other sub-stages switch with 
Distribution Mode towards an odd or even output, using the least significant bit of 
their SUBTAG. Referring to the previous example, an ATM cell with Virtual 
Routing Mode and TAG = 01 101 is routed by the first sub-stage according to 
SUBTAG = XX 1 (odd out) and by the second sub-stage according to 
SUBTAG =110 (out 6); the cell is finally routed to link OLINK 13. 



5 THE 16 X 16 SWITCHING ELEMENT 

The 16xl6SE is realised with the Multichip Module technique using the 
architecture and the solutions previously described. 




Figure 4. MCM 16 x 16 Switching Element substrate top view. 

Electrical and thermal simulations were performed in order to analyse and verify 
signal integrity, simultaneous switching noise (SSN), cross-talk, chip temperatures 
and power dissipation. The Multichip module is a MCM-C implemented with the 
Ceramic Column Grid Array (CCGA) technology by IBM, with a substrate made 
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of Alumina and Molybdenum (size 43.75 x 43.75 mm ) and a package consisting 
of 1073 columns for surface mounting technology [Lic96]. 

The substrate top view is shown in Figure 4. The clock buffer is located in the 
centre in order to allow balanced length of the clock wires, each BASES die has 
280 pads distributed on two peripheral rings for wire-bonding, 16 decoupling 
capacitors and 8 resistors are properly distributed on the substrate with surface 
mounting technology. Many thermal vias are placed inside the die area in order to 
provide thermal dissipation towards columns through substrate. 

In terms of signal integrity the electrical simulations asserted quite good 
interconnections for both clock, synchronism and data signals: this is the great 
advantage of the Multichip module solution compared with the one using four 
packaged BASES components. In fact in the MCM-BASE16 the interconnections 
between the four 8x8 components are not subject to the heavy degradation due to 
the package. The SSN effect was analysed in the worst case when all signals are 
switching except one, the maximum peak-to-peak swing due to SSN is less than 
5 % of the complete swing (1.5 V), Figure 5. 




Figure 5. SSN on a BASES output link. 

No problems on cross-talk analysis were detected. The external interconnections 
(from die pads towards I/O pins) were designed as short as possible to minimise 
impedance mismatch. 

The substrate cross-section is organised in 1 1 layers: 5 signal layers for routing 
and redistribution (lines with 50 Q impedance), 4 power supply layers 
(implemented as a meshed plane), the top and bottom layers for die attach and 
columns location. 

The package is an array of 33 x 33 pins (335 for signals, 442 for GND and 
296 for and power supplies). These pins are columns of Pb/Sn with 
1.27 mm pitch and 2.2 mm height for surface mounting technology. 

Due to the high density of power dissipation (2.3 W for BASES and 0.4 W for 
clock buffer) and the high thermal constraints (free convection at 0.25 m/s and 
ambient air temperature of 40 °C) a Thermalloy aluminium heat-sink is attached 
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over the non-hermetic lid of substrate in order to provide thermal dissipation; 
moreover the high number of power columns used as thermal columns improves 
dissipation from the MCM towards the board. 

The 3D thermal simulation shows that a forced air flow at 1 m/s is strictly 
recommended in order to guarantee a junction temperature less than 100 °C. In 
terms of boards both the 8x8 and the 16 x 16 (Figure 6) are single board 
solutions. Complete compatibility is guaranteed in terms of connections, controls 
and back-plane insertion. 




Figure 6. 16 x 16 Switching Board. 



6 UTXC SYSTEM APPLICATION 

The UTXC reference functional architecture reflects a partitioning that allows a 
simple and flexible node sizing. In fact it can grow from 5 Gbit/s to 160 Gbit/s by 
adding modules [Col94, Bal95]. 

The overall functionality of the ATM switching system, described above, is 
realised by connecting a limited number of basic modular chips, with the only 
addition of external SRAMs, one or more microcontrollers and commercial line 
driver ECL buffers. 

The circuits, designed by CSELT and Italtel, are inserted in the ATM Switching 
architecture of the Italtel UTXC node (Figure 7). The peripheral part (Exchange 
Termination), representing the ATM interface of the Broadband Termination Unit 
(BTU) acts on both the incoming and outgoing cells, the CHP and DPRC 
components being bi-directional. 
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The input ATM cell flow is subjected to UNI/NNI ATM Layer operations, 
including Usage/Network Parameter Control and VPWCI switching, fully 
complying to ITU-TS Rees. Then, the ATM cell format is transformed into the 
proprietary one, adding the internal information used inside the system for routing 
and handling the cell, while adapting the external bit rate (i.e., 155.52 Mbit/s) to 
the internal one (i.e., 311.04 Mbit/s, due to the different cell format and internal 
speed advantage). The proprietary cell is then duplicated, after having been 
marked with a time stamp value, and each copy is sent independently to the 
corresponding part of the ASF, that is one of the two identical parts forming the 
duplicated ASF. Within the ASF, the cell is routed according to a random 
distribution algorithm to the ASF output port identified by the routing fields 
contained in the cell proprietary header. 

Figure 7 shows the complete 160 Gbit/s configuration, using 2 stages of 8 x 8 SE 
in the ATM Peripheral Modules (APM) and 3 stages of 16 x 16 SE in the central 
stages; the first stages are routed in distribution mode the others in virtual mode in 
order to send the cells to the required connection. 




Figure 7. UTXC Switching Network. 

The 10 Gbit/s and 20 Gbit/s are obtained with 3 stages using respectively the 
8x8 or the 16 x 16 SE in the central stage. After the ASF output port the 
duplicated cell is properly selected and the cell sequence is rebuilt thanks to a 
maximum delay equalisation mechanism. Finally, the internal cell format is 
transformed into the ATM external one by removing the proprietary fields, while 
adapting the speed to the external one. 



7 CONCLUSIONS 

ATM tight requirements both in terms of speed and flexibility force a very good 
mastery of technology at any level (IC, packaging techniques, boards, mechanics, 
interconnections and so on). An Integrated Design methodology is required to 
provide good industrial systems. The described ATM Switching Element is a clear 
example how system requirements, IC design, new packaging solutions and 
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architectural issues can be merged in order to provide efficient and modular 
solutions for the high demanding Telecom business. 

The availability of components with different capacity and cost (8 x 8, 
16 X 16...) well fits the different application requirements both in terms of system 
costs, integration and global throughput. 
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Abstract 

The yield of VLSI processors with on-chip cache can be enhanced considerably by 
tolerating cache defects. It has been shown that the performance degradation due to 
disabling the faulty blocks is small enough for set-associative caches while in the 
case of direct-mapped caches may be substantial. In this paper we present a 
reconfigurable cache capable of operating either as direct-mapped (DM) or as two- 
way set-associative (TW). In this way VLSI processor chips with defective cache 
blocks are not discarded, attaining a yield enhancement and are also used in the 
operation mode that minimises the performance degradation. Trace driven 
simulation has been used to determine the minimum number of faulty blocks after 
which the TW operation mode is more profitable. This minimum value depends 
on cache size, block size, the access time of the cache and the miss penalty time. 
For computing the access time of the caches, an analytical access time model for 
on-chip caches already proposed in the open literature has been used. 



1 INTRODUCTION 

High-performance single chip VLSI processors make extensive use of on-chip 
cache memories to sustain the memory bandwidth demands of the CPU 
(Shoemaker‘1990, Intel-1992, Mirapuri-1992, Motorola- 1993, Edmodson-1995). 
The area that these on-chip caches occupy is a significant portion of the total chip 
area (Saxena-1995) and is expected to grow further in the future. As the chip area 
devoted to the on-chip cache increases, a significant portion of the manufacturing 
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defects will occur in the cache portion. If these defects can be tolerated without a 
substantial performance loss, then the yield of VLSI processors with on-chip cache 
can be enhanced considerably. 

A technique for tolerating defects is the use of redundancy {Moore- 1986, Sohi - 
1989, Pour-1993). A form of redundancy in cache memories is the use of spare 
blocks. After production test the defective cache blocks are substituted by spare 
blocks using electrical or laser fuses. A similar technique uses spare word and / or 
bit lines that are selected instead of faulty ones. This technique for example is used 
in the caches of the MIPS R4000 processor (Mirapuri-1992). The above 
redundancy techniques impose an area overhead for accommodating the spare 
circuitry and the logic needed to implement the reconfiguration. Redundancy can 
also have the form of extra bits per word for storing an error correcting code. The 
classical application of a Single Error Correcting and Double Error Detecting 
(SEC-DED) Hamming code in the on-chip cache was investigated by Sohi {1989) 
and it was found to be a non attractive option for yield enhancement of high- 
performance VLSI processors. In {Vergos-1995 a) it was shown that defects in the 
tag store of a cache have more serious consequences on the integrity and 
performance of a system than similar defects in the data store; to this end a new 
SEC-DED code exploitation method was introduced. Unfortunately, this technique 
is only capable of masking single errors per word as for example the errors caused 
by a bit line defect. 

Cache memory is by itself a redundant module in the sense that it is not 
necessary for the correct operation of the processor; it affects only the 
performance. Thus, a possible technique to tolerate defects in cache memories is 
the disabling of the faulty cache blocks. This technique was investigated in {Sohi- 
1989, Pour-1993) and it was found that the mean relative miss ratio increase due to 
disabling the defective cache blocks decreases with increasing cache size and is 
negligible for a very small number of faulty blocks unless a set is completely 
disabled. Unfortunately, the number of faulty blocks can be large. A very small 
number of random spot defects in the tag part of a cache memory can affect a large 
number of tags {Vergos-1995 a), leading to disabling of a large number of blocks. 
Also, because of the clustering of defects {Koren-1989), and the fact that the on- 
chip cache is a large portion of the total chip area, it is possible a large number of 
defects to appear in the cache while all the critical resources of the chip to be 
defect free. Besides the above, in direct-mapped (DM) caches a set is comprised 
by just one block and thus disabling a faulty block results in the disabling of a set. 
Therefore, in DM caches disabling even a very small number of faulty cache 
blocks results in substantial performance degradation. Usually the access time of 
the first level on-chip cache imposes the cycle time of the high-performance VLSI 
processors. Then taking into account that DM caches offer smaller access times 
that their set-associative (SA) counterparts {Hill-1988, Wada-1992, Wilton-1994), 
it is implied that in these cases the use of DM caches is advantageous. Moreover 
in many cases DM caches offer smaller average access times than SA ones for 
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sufficiently large sizes (Hill-1988). Thus, as the size of the on-chip caches 
increases, the first level on-chip caches of the future systems is expected to be DM 
(Jouppi-1990). The performance recovery in DM faulty caches via the use of a 
very small fully associative spare cache was investigated in (Vergos-1995 b). The 
method which will be proposed in this paper in many cases is superior than the 
above technique with respect to the average access time, which is a good metric of 
the memory hierarchy performance (Patterson- 1990). 

As we have already mentioned, the miss ratio increase of SA caches due to 
disabling a number of faulty blocks is smaller than the corresponding DM caches. 
Therefore, in the cases that DM caches have smaller average access times in their 
fault free operation, we expect that when the number of disabled faulty blocks 
exceeds a specific number, depending on the cache and block size and the miss 
penalty time, a SA cache will have smaller average access time. If the above 
reasoning is valid, then an on-chip reconfigurable cache, capable of operating 
either in DM or in two-way set-associative (TW) mode will be very attractive. The 
number of faulty blocks in the on-chip cache of a VLSI processor can be 
determined during testing. Thus, the chip will be used in an application with the 
cache operating in the mode (DM or TW) that will offer the best average access 
time for the given configuration of the system. In this way we succeed in two 
targets. VLSI processor chips with defective cache blocks are not discarded, 
attaining a yield enhancement and are also used in a way that the consequences of 
the defects in the system performance are minimised. 

In this paper we give the design of a reconfigurable cache with DM and TW 
mode of operation. However, to show that this design is not meaningless we have 
to answer the following question. How much longer (if any) is the access time of a 
reconfigurable cache operating in DM mode compared to a non reconfigurable 
optimal, with respect to access time, DM cache? The answer to this question is 
crucial, because if the access time of a reconfigurable cache operating in DM mode 
is longer than that of the optimal non reconfigurable TW cache with the same 
cache and block sizes, then the non reconfigurable TW cache can be used more 
efficiently instead of the corresponding reconfigurable cache. Using a well- 
established analytical access time model for on-chip caches (Wilton-1994) we 
show that for all practical cases (cashes with size greater than 4 KB) the access 
time of the reconfigurable cache operating in DM mode is equal to that of the 
optimal non reconfigurable DM cache. In the rest cases the imposed delay is very 
small and the access time of the reconfigurable cache operating in DM mode 
always remains smaller than that of a non reconfigurable TW cache. Using trace 
driven simulation we show that when the number of disabled faulty blocks exceeds 
a minimum number F, the reconfigurable cache operating in TW mode provides 
smaller average access time compared to that of the DM mode of operation. Also, 
using trace driven simulation we reveal the dependence of the value of F on the 
cache and block sizes as well as the miss penalty time. 
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2 DESIGN OF A RECONFIGURABLE CACHE 

In DM caches each block of main memory can be placed in one specific frame of 
the cache memory while in TW caches it can be placed in any of two specific 
frames of the cache. The address used to access a DM or a TW cache is 
considered to consist of three parts <tag, index, word>. When both caches have 
the same size and block size the length of the word part is the same while the index 
part in the case of the TW cache is shorter by one bit than that of the DM cache. 
Of course, the opposite occurs with respect to the tag part. Figure 1 presents a 
block diagram of a reconfigurable cache that can operate in either DM or TW 
mode. The signal \ is used for selecting the required mode of operation and 
corresponds to a pin of the chip. According to the above, in a reconfigurable cache 
capable to operate in either DM or TW mode an address bit denoted by X in Figure 
1 will be used as a part of the index when the cache operates in DM mode and as a 
part of the tag in the other case. 

2.1 DM mode of operation. 

In this mode of operation all multiplexers (MUXs) in Figure 1 permit signal X to 
pass at their outputs. 

We will examine what happens during an access for reading. The Data Address 
is used to address both data banks simultaneously. Also the Tag Address is used to 
address both tag banks simultaneously. Although both data banks are accessed in 
parallel, only one of the output buffers (selected by the value of X bit) is permitted 
to place its contents when they become available on the Data Out bus. Also only 
one of the two tags (selected by the value of X bit) is routed to the selected 
comparator. Only the output of the selected comparator can then affect the 
Hit/Miss signal. The multiplexers are not on the critical path of the cache, thus no 
delay is imposed on the access time of the cache over the corresponding non 
reconfigurable DM cache’s access time with the same layout. 

Depending on the cache and block sizes either the Hit/Miss signal generation 
path or the data access path may be the critical path of the cache. The processor 
can use the data even before the generation of the Hit/Miss signal if this results in 
better access time {Chang-1987). In this case the processor must be equipped with 
rollback capabilities that are used when a miss is discovered. Similar actions must 
take place for writing accesses. Modification of data though can not begin before 
the generation of Hit/Miss signal is complete. 





Figure 1. Block diagram of a reconfigurable cache that can operate in either DM or TW mode. 
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2.2 TW mode of operation. 

In this mode of operation all MUXs inhibit signal X to pass at their outputs. 

Again we will examine what happens during an access for reading. The Tag 
Address is used for addressing both tag banks in parallel. Also the Data Address is 
used to address both data banks. The two accessed tags are driven to the 
corresponding comparators for parallel comparison with the incoming tag. 
Although the data and tag banks are accessed in parallel, the Data Out bus can not 
be driven before the completion of the tag comparison procedure. Only one 
comparator can detect an equality (hit) and this will affect the Hit/Miss signal and 
will enable the corresponding Output Data Buffer to place its contents on the Data 
Out bus. We can see that in this case the MUX can be on the critical path, thus the 
access time of the reconfigurable cache operating in the TW mode may be longer 
than the access time of a non reconfigurable TW cache with the same layout by at 
most the delay of a MUX. The MUX will be on the critical path of the cache, if: 

^comp ^mux ^data* 

where t^^^ and tj^^ are the access time of the Tag and Data Banks respectively, and 
W’ ^inux delays of the comparator and the multiplexer respectively. In this 

case the imposed delay on the access time of the cache will be equal to: 

min { U. tug+ Up+ tmu.- tdau)- 

SA caches that allow the required word of the most recently used block in the 
selected set to appear on the data out lines before the tag comparison is complete 
{Chang-1987) is out of the scope of this paper. Updating can be done in parallel 
with sending the required data to the microprocessor and no additional delay is 
imposed on the access time of the cache over that of the corresponding non 
reconfigurable TW cache with the same layout. 

2.3 Time and hardvi^are overhead of the reconfigurable cache. 

In this section we will investigate the time and hardware overhead of the 
reconfigurable caches with respect to the corresponding non reconfigurable DM 
caches. We will firstly consider the cache access time overheads. 

Analytical models are the best way to evaluate trade-offs among various 
alternatives without the cost of implementing each different alternative. To the 
best of our knowledge, two analytical models (Wada-1992, Wilton-1994) have 
been proposed for computing the access time of an on-chip cache that take into 
account the layout parameters along with the organisational parameters (cache 
size, associativity and block size). Both models have been validated by 
comparison with a Spice model. The model presented in {Wilton-1994) is more 
accurate, since it extends the work of {Wada-1992) by the inclusion of an 
additional array organisational parameter, improved decoder and word line 
models, precharged and column multiplexed bit lines and a tag array model with 




Reconfigurable CPU cache memory design 



109 



comparator and multiplexer drivers. Thus, the model presented in (Wilton-1994) is 
used in this work. 



Table 1 The number of segments per data and tag bit lines in the optimal DM 
caches 



Block=8 Bytes 

Cache Size Nm Mw 



8KB 
16 KB 
32 KB 
64 KB 
128 KB 
256 KB 



2 

2 

4 

4 

8 

8 



2 

2 

4 

4 

4 

8 



Block=16 Bytes 

H(lbl Nlbl 



2 

2 

2 

4 

4 

8 



2 

2 

2 

4 

4 

4 



Block=32 Bytes 
Ndhl blthl 



2 

2 

2 

4 

4 

4 



2 

2 

2 

2 

4 

4 



As can be seen from Figure 1 each of the tag and data stores of the proposed 
reconfigurable cache must consist of at least two banks (two segments per tag and 
data bit line). Table 1 presents the number of segments per tag bit line (N,^,) and 
per data bit line of an optimal, with respect to access time, DM cache, 
according to the analytical time model presented in (Wilton- 1994). Cache sizes up 
to 256 KB and block sizes of 8, 16 and 32 bytes are considered. 

From Table 1 we can see that for cache sizes greater than or equal to 8 KB the 
optimal non reconfigurable DM cache consists of two or more segments per tag 
and data bit line. In these cases the access time of the reconfigurable cache 
operating in DM mode is the same with the access time of the corresponding 
optimal non reconfigurable DM cache. Then taking into account that modern 
single chip VLSI processors usually offer on-chip caches greater than or equal to 8 
KB we conclude that for cache sizes of practical interest the access time of the 
reconfigurable cache operating in DM mode is the same with the access time of the 
corresponding non reconfigurable DM cache. 

Using the analytical time model (Wilton-1994) we verified that the access time of 
the large as well as the small reconfigurable caches operating in DM mode is 
shorter than that of the optimal TW caches with the same capacity and block size. 

The hardware overhead of the reconfigurable cache with respect to a non 
reconfigurable DM cache with the same cache and block size consists of: 
a) Four 2->l MUXs as shown in Figure 1. 
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b) An extra tag comparator. 

c) Three extra bits per tag word. These bits are used as follows: 

• The disabling of the faulty blocks can be achieved by the addition of a second 
valid (availability) bit. The availability bits can be implemented by either non 
volatile memory cells or by normal static RAM. In the latter case the silicon 
area required is smaller than that needed by the non volatile memory 
implementation. However, each time the system is powered up, the availability 
information must be loaded into the availability bits from a safe copy. 

• As we have previously discussed, the tag of the reconfigurable cache when 
operating in TW mode is by one bit longer than the tag in the DM mode of 
operation. Hence, an extra tag bit is required for storing the increased tag. 
This bit is not used in the DM mode of operation. 

• A bit per tag word is used for the implementation of Least Recently Used 
(LRU) policy in the TW mode of operation. 

To get an estimation of the above overhead in area we used the area model 
presented in {Mulder- 1 991), Applying the model, we got that the reconfigurable 4 
KB and 32 KB caches with 16 bytes blocks, require only 1.87% and 1.88% 
respectively more area than the optimal non reconfigurable DM corresponding 
caches. It is obvious that the area overhead is very small. (In these calculations 
we considered that the availability bits are implemented by static RAM). 



3 EVALUATION OF THE RECONHOURABLE CACHE 

In order to compare the alternative cache configurations the average memory 
access time will be used, which is a good metric of memory hierarchy performance 
{Patterson- 1990). For the average access time we have: 

T = Icache+ ^ 

where t^^^^ is the access time of the cache and TM is the miss penalty time, while m 
is the miss ratio of the cache. To this end, we need to determine the miss ratios 
when either none or some of the cache’s blocks are disabled and the access time of 
each cache configuration. 

3.1 Miss Ratio Determination and Access Time Calculation 

Trace driven simulation is the best way for determining the miss ratio of a cache 
when no faults have occurred {Smith-1982). For our simulations we used the 
ATUM traces because they include both operating system references and 
multiprogramming effects and the way that these traces were gathered introduces 
minor errors {Agarwal-1986). Due to the large number of traces, we present 
results only for one combined trace described as all in {Pour-1993). Table II in 
{Pour- 1993) lists the number of instruction fetches, data reads and data writes for 
each individual trace used as well as a brief description of their origins. The all 
trace was formed by concatenating the individual traces with cache flushes inserted 
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between them. Since each individual trace is only about 400000 references long, 
we simulate cache sizes up to 32 KB. Larger cache simulation would be impossible 
without inserting much error {Pour-1993, Stone-1987). 

Two alternatives can be used for determining a cache’s miss ratio when some 
blocks have been disabled due to faults, namely trace driven simulation for each 
possible or for a number of the possible faulty combinations (Sohi-1989) and 
probabilistic theory based on least recently used distances {Pour-1993). The 
second alternative was chosen in this work since it provides accurate results and 
requires less time. According to this approach the mean miss ratios of caches with 
a number of disabled faulty blocks can be computed from the miss ratios of the 
non faulty caches and the occurrence probability of each faulty combination. 

To calculate the cache access time, we used the time model presented in {Wilton- 
1994) as already mentioned. The delay of a 2->l MUX in an implementation 
technology with minimum feature size of 0.8 micron was considered to be equal to 

0.35 ns. 

Table 2 Number of faulty blocks F after which the TW mode of operation offers 
shorter average access time. Block size = 16 Bytes. 



Cache Size (Bytes) 



Miss penalty time 


256 


512 


IK 


2K 


4K 


8K 


16K 


32K 


25 


3 


5 


10 


19 


39 


55 


142 


280 


50 


0 


1 


3 


6 


14 


20 


56 


116 


75 


0 


0 


0 


3 


7 


9 


31 


68 


100 


0 


0 


0 


0 


4 


4 


19 


45 


125 


0 


0 


0 


0 


0 


0 


12 


31 


150 


0 


0 


0 


0 


0 


0 


0 


22 


175 


0 


0 


0 


0 


0 


0 


0 


16 


200 


0 


0 


0 


0 


0 


0 


0 


11 



3.3 Results. 

Tables 2 and 3 give the value of F for cache sizes from 256 bytes up to 32 KB and 
for block sizes 16 and 32 bytes respectively. TM is varied from 25 up to 200 ns. 
A value of F equal to zero means that a reconfigurable cache operating in TW 
mode always offers better average access time. The conclusions that can be drawn 
from these tables are: 

1 . For constant values of cache and block sizes, increasing TM decreases F. 

2. For constant values of block size and TM, increasing cache size increases F. 
This is because a larger cache contains more blocks than any smaller with the 
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Table 3 Number of faulty blocks F after which the TW mode of operation offers 
shorter average access time. Block size = 32 Bytes. 

Cache Size (Bytes) 

Miss penalty time 256 512 IK 2K 4K 8K 16K 32K 



25 


2 


3 


6 


12 


22 


45 


73 


190 


50 


0 


1 


2 


4 


8 


17 


30 


79 


75 


0 


0 


0 


2 


5 


9 


17 


48 


100 






0 


1 


3 


6 


11 


33 


125 




0 


0 


0 


2 


4 


7 


24 


150 




0 


0 


0 


0 


2 


5 


19 


175 


0 


0 


0 


0 


0 


0 


3 


14 


200 




0 


0 


0 


0 


0 


2 
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same block size. The hit ratio deterioration caused by faulty block disabling 
depends on the percentage of the disabled faulty blocks out of the total. If the 
same number of faulty blocks is disabled, in a smaller cache a greater 
percentage of the total cache capacity gets disabled and the hit ratio 
deterioration is larger. 

3. The dependence of F on the block size can be explained similarly. Since a 
larger block size for the same cache size, means a smaller number of blocks, it 
is expected that the value of F will drop upon moving to larger block sizes and 
verified by Tables 2, 3. Disabling of a larger block means that a greater 
percentage of the total cache capacity gets disabled. 

In {Vergos-1995 b) the performance recovery in DM faulty caches via the use of 
a very small fully associative spare cache was investigated. In many cases the 
reconfigurable cache design proposed in this paper offers better average access 
time than the use of the method proposed in (Vergos-1995 b). 



4 CONCLUSIONS 

To achieve high performance in single chip VLSI processors, on-chip cache 
memories are used. With the increase of the chip area devoted to on-chip caches, 
it is expected a substantial portion of the manufacturing defects to occur in the 
cache portion of the VLSI processor chip. If the cache defects are tolerated 
without a noticeable performance degradation, the yield of VLSI processors can be 
enhanced considerably. It has been shown that the performance degradation due to 
disabling the faulty blocks is small enough for SA cache while in the case of DM 
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caches can be substantial. However, DM caches are the fastest and also offer 
smaller average access time than SA ones for sufficiently large size. Thus, as the 
size of the on-chip cache increases the use of direct-mapped caches is favoured. 

In this paper we have designed a reconfigurable cache capable of operating in 
either DM or TW mode. The proposed design offers the ability VLSI processor 
chips with a partially good reconfigurable cache to be used in the operation mode 
(DM or TW) that minimises the average access time increase due to faulty block 
disabling. Since the use of the chips with partially good reconfigurable caches 
implies a very small performance degradation, we believe that these chips can be 
accepted during production testing leading to a significant yield enhancement. 
Apart from the production testing faulty cache block disabling can also take place 
during the on field testing. Then, i^ the total number of disabled faulty blocks 
(caused by manufacturing defects and permanent operational faults) exceeds F, the 
operation mode of the reconfigurable cache can be switched from DM to TW. A 
significant feature of the reconfigurable cache is that the access and average access 
time of it operating in DM mode for cache sizes of practical interest (specifically 
sizes greater than 4 KB) is the same with that of the corresponding optimal DM 
cache. Therefore, VLSI processor chips with a fault free reconfigurable cache will 
operate equally fast with the chips with optimal DM cache. Also, the area 
overhead of the reconfigurable cache with respect to the corresponding optimal 
DM was estimated to be very small, about 1.88%. In this paper we have also 
investigated the dependence on cache and block size and on miss penalty of the 
minimum number F of faulty cache blocks after which the TW operation mode of 
the cache offers a shorter average access time. 
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Abstract 

A low- voltage operational transconductance amplifier(OTA) and a bandpass Gm-C 
filter built with the OTA are described in this paper. The OTA has effectively only 
two MOS transistors in saturation region between Vdd and GND - suitable for low 
voltage operation, and there are no internal nodes in the OTA - suitable for high 
frequency operation. With this OTA, a prototype bandpass Gm-C filter was 
implemented in O.Spm CMOS process. The center frequency can be controlled 
from 15.0MHz to 33.0MHz. The filter consumes only 0.43mW per pole when the 
center frequency is 15.0MHz. 
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OTA, MOS transistor, low-voltage operation, low-power consumption, high- 
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1 INTRODUCTION 



Gm-C filters have been widely used for high frequency application because its 
open loop architecture provides high frequency capability (Stefanelli, 1993), 
(Khoury, 1991). However, as the supply voltage is scaled down for low-power 
consumption, high-frequency operation becomes difficult to be achieved due to the 
rapid performance degradation of OTA's. 

MOS transistors in the linear operation region can be used in OTA's for low- 
voltage operation, but it generally requires bipolar transistors to keep the drain-to- 
source voltage of the MOS transistors constant (Rezzi, 1995). As an alternative, B. 
Nauta used CMOS inverter as OTA and achieved 22.0MHz cut-off frequency with 
2.5V single power supply. This is due to the fact that CMOS inverters do not have 
internal nodes limiting the operation frequency. However, the supply voltage 
should be controlled to tune the frequency characteristics of the filter (Nauta, 
1992). 

In this paper, a differential low- voltage OTA is described which is designed to be 
simple for low power and high frequency operation. The common mode feedback 
is achieved with no additional power consumption, aiding low power operation. 
With the OTA, a prototype bandpass Gm-C filter is implemented in a 0.8pm 
CMOS process whose center frequency can be controlled from 1 5.0MHz to 
33.0MHz with 3.0V single power supply. 



2 LOW-VOLTAGE OPERATIONAL TRANSCONDUCTANCE 
AMPLIFIER 



In Figure 1, the low- voltage OTA is shown with its bias circuit. The transistors 
M\^M^ are in the saturation region, and the transistors M 5 and are in the linear 
operation region. The basic voltage-to-current conversion is done by the transistors 
Ml and M 2 , and the transistors M 3 and M 4 act as the load. The common mode 
feedback is performed by the transistors M 5 and This common mode feedback 
loop requires no additional current consumption, aiding low power operation. 
Except for the source node of the transistors M 3 and M 4 , there are no internal nodes 
in the OTA. So, the filter built with this OTA can achieve high-frequency 
operation. And because there are only two transistors in saturation region between 
Vdp and GND, the OTA can operate at low power supply voltage. 

The output current of the OTA is given as follows; 
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Figure 1 (a) Low- voltage, low-power operational transconductance amplifier. M\, 
M 2 , Mj and in the saturation region, and M^ and A/g which constitute 

common mode feedback loop are in the linear operation region, (b) Replica bias 
circuit for the OTA generating V> such that the output is CMREF when input is 
CMREF. 
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Figure 2 Linearity error as a function of differential input. 



hut- — p\{^GS\ +^G52 (1) 

where Pi = Pn^ox ^)i • Froin the equation ( 1 ), the transconductance of the 
OTA is; 

8m = “A(^G51 +^G52 (2) 

As can be seen in the equation (2), the transconductance of the OTA is determined 
by the size of the transistor Mi and the conunon mode level of the input. Thus, in 
order to get a linearized transconductor, we must have stabilized common mode 
level of the input. Within a filter, this is achieved by the bias circuit shown in 
Figure l-(b) and the common mode feedback loop consisting of the transistors M 5 
and Me, assuming that the input to the filter has stabilized common mode level. 

The replica bias circuit shown in Figure l-(b) generates the bias voltage V/> such 
that when the input voltage is CMREF, the output voltage is also CMREF. The 
value of CMREF is Vz)Z)/2. 
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Figure 3 Second-order biquad circuit. The variable capacitors are to tune the 
frequency characteristic to the desired one when there are temperature variation 
and process drift. The variable capacitors are built with binary weighted capacitor 
array. 



The size of the transistors of the replica bias circuit in Figure l-(b) is same as that 
of the same numbered transistor of the OTA in Figure l-(a). 

Figure 2 shows the linearity error as a function of differential input. As can be 
seen, the error is smaller than 2.0% for -l.OV < v,>, < l.OV at DC. The linearity 
error is defined as; 



error = 



^out ^out (^) S m (Q)^m 



Xl00[%] 



(3) 



where v,„ is the differential input, w(0) and g„(0) are the output current and 
transconductance when v„= OV, respectively, and i„ui is the output current. 



3 SIXTH-ORDER BANDPASS GM-C FILTER 



In order to verify the usefulness of the low- voltage, low-power OTA described in 
previous section, a sixth-order bandpass Gm-C filter is implemented by cascading 
three biquadratic sections shown in Figure 3. 
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Figure 4 Binary weighted capacitor array. The capacitance is controlled with the 
digital control signal D[2:0]. 



The center frequency of the filter is controlled by the variable load capacitors 
instead of the transconductance of the OTA. This is because the transconductance 
of the OTA cannot be changed if the size of the transistors and the common mode 
level of the input are determined once. The tunability of the frequency 
characteristic of Gm-C filter is a must because the frequency characteristics of 
continuous-time filters can change as much as ±50% due to process drift, 
temperature variation, aging, and etc (Yoo, 1996), (Gopinathan, 1990), 
(Khorramabadi, 1996). 

The variable capacitors are built with the binary-weighted capacitor array shown in 
Figure 4. The capacitance is controlled by the digital signal D[2:0]. The control 
range can be extended easily if desired. 

Although the capacitance has to be controlled manually in the present 
implementation, it can also be controlled automatically by monitoring the 
frequency characteristic of the filter as in conventional master-slave tuning scheme 
(Yoo, 1996), (Gopinathan, 1990). There are several works related to this issue, and 
they can be applied to this filter without any difficulties (Khorramabadi, 1996). 



4 EXPERIMENTAL RESULTS 



A prototype sixth-order bandpass Gm-C filter was implemented in 0.8|xm double- 
metal CMOS process, and occupies 1.0mmx0.4mm of silicon area. 
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Table 1 Performance comparison with other CMOS continuous-time filters. 







N 


Power [mW] 


FM 


■miSHESSH 


15,000 


5 


96 


2.89 




0.945 


3 


0.0126 


2.35 


Huang, 1995 


560 


2 


2.5 


2.65 


This work 


■msssHH 


6 


2.58 


4.54 



The center frequency can be controlled from 15.0MHz to 33.0MHz by varying the 
size of capacitors with digital control signal D[2:0]. The filter consumes only 
2.6mW with 3.0V single power supply. Peak-to-peak differential input can be 
applied upto 0.85 V for 1.0% total harmonic distortion when the center frequency is 
15.0MHz. 

The frequency characteristic of the filter is shown in Figure 5. The passband gain 
appears to become smaller as the center frequency gets higher because the output 
buffer and test set-ups have lowpass characteristic. But the filter itself has constant 
passband gain regardless of its center frequency. 

The transient response of the filter is measured and the spectrum of the filtered 
output is observed with the center frequency controlled to be 1 5.0MHz. 

When the input of l.OMHz sinusoid(in stopband) added to 1 5.0MHz sinusoid(in 
passband) is applied, the filtered output is shown in the bottom trace of Figure 5- 
(a) with the input in the top trace. Figure 6-(b) is the spectrum of the filtered 
output. Only the 1 5.0MHz component in the passband can be seen. 

Figure 7-(a) and (b) are the waveforms of input and output and the spectrum of the 
output respectively with the input of 14.0MHz sinusoid added to 16.0MHz 
sinusoid(both of them are in the passband). Both of 14.0MHz and 16.0MHz 
components can be seen in the filtered output. 

The power consumption of the filter is compared with those of ones presented 
elsewhere based on the figure of merit defined as; 



f 

FM = log 



f cut-off 

Power / N 






J 



(4) 



where f cut-off is the center or cut-off frequency of filter in kHz and Power/N is the 
power per pole in mW. The figure of merits of some works are summarized in 
Table 1. We can see that the prototype filter has the advantage of low power 
consumption. 
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(b) 



Figure 5 Frequency characteristic of sixth-order bandpass Gm-C filter, (a) When 
D[2:0] is 000-011, (b) When D[2:0] is 100-111. Center frequency can be 
controlled from 15.0MHz to 33.0MHz. The passband gain gets smaller as the 
center frequency becomes larger due to the lowpass characteristic of test set-ups 
and output buffer. 
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Figure 6 (a) Transient response when D[2:0] = 000, that is, cut-off frequency = 
15.0MHz. Top trace = input, sum of l.OMHz and 1 5.0MHz sinusoid. Bottom trace 
= filtered output, (b) Spectrum of the output in (a). It can be seen l.OMHz 
component is attenuated. 
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Figure 7 (a) Transient response when D[2:0] = 000, that is, cut-off frequency = 
15.0MHz. Top trace = input, sum of 14.0MHz and 16.0MHz sinusoid. Bottom 
trace = filtered output, (b) Spectrum of the output in (a). Both of 14.0MHz and 
16.0MHz component can be seen. 
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5 CONCLUSION 



A low-voltage OTA is described and it is applied to a bandpass Gm-C filter. The 
OTA can achieve high frequency operation with lowered power supply. A 
prototype sixth-order bandpass filter was implemented in 0.8pm CMOS process 
and the center frequency can be controlled from 15.0MHz to 33.0MHz. The filter 
consumes 0.43mW per pole with 3.0V single power supply. 
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Abstract 

A second generation integrator based on the switching current memory cell 
reported in (Gongalves, 1996 A) has been prototyped. The constant voltage 
switching of the integrator is well suited to low voltage applications, since it 
avoids the conduction gap of the switches as well as the signal dependent charge 
injection. A programmable biquad has been implemented using the proposed 
second generation integrator. The center frequency and the quality factor can be 
tuned independently. 
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Switched current circuits and Digital programmable filter. 
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1 INTRODUCTION 

Sampled data circuits have been intensively employed in VLSI chips. The 
switched capacitor (SC) technique has been the prevailing one over the last two 
decades. SC filters achieve a high accuracy with a low distortion. However, 
besides requiring a double poly process, the standard SC technique has the 
problem of increasing prohibitively the resistance of the switches for low voltage 
operation (Crols, 1994). If the supply voltage is lower than a certain minimum 
value, the switch resistance tends towards infinity (Crols, 1994 and Vittoz, 1994) 
for a range of the input level. This “conduction gap” is a critical limitation for SC 
filters. There are some special techniques to deal with this problem, such as the 
use of dedicated processes, on-chip generation of a voltage larger than the power 
supply and the switched op-amp (Crols, 1994). Of course, these techniques add 
some extra cost to the chip. 

In the late 1980s a new sampled data technique called switched current (SI) was 
introduced (Hughes, 1989 and 1990). The basic SI circuit is the current mode 
track and hold circuit shown in Figure 1(a). This technique presents the same 
limitation of SC circuits with respect to the conduction gap of the switches when 
operated at low supply voltages. To overcome the problem of the conduction gap 
of the switches, the SI mirror scheme shown in Figure 1(b) was presented in 
(Gon 9 alves, 1996 A). 

In this work we propose a second generation SI integrator. In the new integrator 
the switches operate at constant voltage, thus avoiding the conduction gap 
existing in conventional SI circuits. Moreover, the charge injected by the switches 
becomes signal-independent. The proposed integrator has been prototyped and 
programmed by using MOSFET-Only Current Dividers (MOCD) (Bult, 1992 and 
Gon 9 alves, 1996 A). A programmable integrator-based biquad which allows 
independent tuning of the center frequency and the quality factor has been 
implemented. 





Figure 1 (a)Conventional SI mirror. 

(b) SI mirror proposed in (Gon 9 alves, 1996 A). 
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2 SECOND GENERATION (SI) INTEGRATOR 



The conventional second generation SI integrator (Hughes, 1989) is shown in 
Figure 2(a). In this paper, we propose a second generation SI integrator based on 
the SI mirror shown in Figure 1(b). The integrator is made up of two switched 
current memory cells, as shown in Figure 2(b). Two available outputs are Iqa and 
loB- 

In the odd phase : 



while in the even phase 



loA (n) = a Ia(d) = -a{Iin(n) + lB(n)} 



lB(n-l/2)=lB(n-l) = -lA(n-l) 



(l.a) 

(l.b) 




(a) 



(b) 



e . o 



<l> 

<l> 





<l> 







(n-1) (n-1/2) (n) 



(C) 

Figure. 2 - (a) Conventional second generation SI integrator. 

(b) Proposed second generation SI integrator. 

(c) Switching clock sequence. 
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From equations (l.a) and (l.b) we can write 



loA (n) = -a Ii„(n) + IoA(n-l) 

and 

IoB(n)= Yli„(n-l) + IoB(n-l) 



where a =(W/L)ma/(W/L)m and y=(W/L)mb/(W/L)m. 

W and L are the width and length of the channel, respectively. 
The z-transformation of (2.a) and (2.b) gives: 

1 



I 0 
OA 



<l> 

I o 
IN 



= -a- 



1-Z 



-1 



(2.a) 

(2.b) 



(3.a) 



I 0 
^OB 

(|> 

I o 
^IN 



-1 



= Y- 



1-Z 



-1 



(3.b) 



The timing of the clock waveforms is shown in Figure 2 (c). This switching 
sequence is necessary to avoid the loss of information during clock transition in a 
practical implementation. 

A lossy SI integrator can be realized using the dotted feedback path shown in 
Figure 2 (b). TTie z-domain transfer functions are: 



<b 

I o 
^OA 

I o 
^IN 

d) 

1 0 
^OB 

I o 
^IN 



= -a- 



1+B-Z 



-1 



= Y- 



-1 

1+P-Z 



-1 



where P=(W/L)mc/(W/L)m. 



(4.a) 



(4.b) 



The proposed second generation SI integrator has the same sensitivities to 
transistor mismatches as the conventional SI integrator (Hughes, 1989). 



3 EXPERIMENTAL RESULTS 



The SI lossy integrator shown in Figure 2 (b) was implemented using operational 
amplifiers TL 082, MOS integrated transistors (W=48|xm, L=1.2 pm ), MOS 
switches CD 4007 and holding capacitors of 1.8nF. The loss factor (p) was set by 
a 6-bit MOSFET-Only Current Divider (MOCD) (Bult, 1992) whose scheme is 
shown in Figure3 (a). The 6-bit MOCD was integrated on a Sea of Transistors 
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(SoT) array, in a technology from ES2 (Gongalves, 1996 B). The MOCD 
layout is shown in Figure 3 (b). 




V 

DD 

_ 1 - 





Figure 3 (a)Switched MOCD and Its Symbol. 

(b) MOCD layout on an SoT array. 

The MOCD input impedance is independent of both the digital word and the 
clock phase, thus providing a constant load impedance to the op amps. The 
MOCD is switched by “ANDing” the digital word and the even phase waveforms 

{(|)e.b(bN.i bo)}. The experimental time response of the integrator is shown in 

Figure 4 (a) (p=21/64) for a 1.196kHz input signal. Note that the output signal 
does not present glitches. This important property is due to the constant voltage 
switching of the memory cell. The integrator has been simulated using the ASIZ 
program (Queiroz, 1993). The simulated and experimental frequency responses 
presented in Figure 4 (b) show excellent agreement. 
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4 SECOND ORDER SECTION 

As an application of the second generation SI integrator we built a biquadratic 
section, designed using backward LDI transformation (Silva-Martinez, 1989). 
This transfcnmation leads to smaller frequency prewarping errors than the Euler 
transformations. The biquad circuit is presented in Figure 5. The center frequency 
(Oo and the quality factor Q can be controlled independently if the sampling 
frequency is much higher than the center frequency. In this case; 



oJbTsa 


(5.a) 


Q =l/f 


(5.b) 



A discrete prototype of the filter has been implemented and tested. In 
this experimental work, the transistors were replaced by resistors. The 
programmability of the filter was obtained by scaling the resistances. The unit 
resistor is 20kQ, and the holding capacitors are C=100pF. The band pass filter 
has been programmed for center frequencies f,=150, 300 and 600Hz. The 
sampling frequency was 15kHz and the quality factor was equal to 8. The 
simulation and experimental results are shown in Fig 6. In the case of very low 
cobT, the error in the center frequency is large due to the variability of the 
resistors. To adjust this error we have to decrease the variability of the resistance 
or, for an IC implementation, increase the resolution of the MOCD. 




Figure 5 - Biquadratic section using second-generation SI integrator. 

The DC output caused by the op amp offsets is 

( 6 ) 

where Gdc is the low frequency gain (KLp/a), 
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Voffi is the op amp (An) offset and 

AV is the offset mismatch{Vofn(An)-VoH 2 (Ai 2 )}. 

The DC component of the output, as given by (6), is not large provided that the 
sampling frequency is not very much higher than the center frequency of the 
biquad or the offset mismatch is not high. Dynamic techniques (Vittoz, 1994) can 
also be employed to reduce the effects of the offset mismatch. 




f(Hz) 

Figure 6 Magnitudes of the bandpass filter, 

...-Theoretical and -Experimental 

fo=150, 300 and 600 Hz (Q=8 and fs=15kHz) 



5 CONCLUSIONS 

A second generation SI integrator has been reported in this work. The main 
advantage of the integrator presented here, when compared to the conventional 
one, is its applicability in low voltage circuits. Moreover, the constant voltage 
switching provides the circuit with a signal-independent charge injection. The 
programmability of the SI integrator has been tested using a digitally 
programmable MOCD. A bandpass SI filter has been implemented with discrete 
elements and tested at different center frequencies. 
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Abstract 

A novel current-mode analogue continuous-time filter synthesis technique is 
proposed based on a transformed set of state-equations, that allow the utilization of 
only current-mode linear lossy-integrators implemented using standard current 
mirrors; also a novel current-mode gyrator is introduced. The technique has been 
successfully used for synthesizing low-voltage 5th order lowpass, highpass, 
bandpass and stopband Chebyshev filters from passive doubly loaded ladder 
prototypes. SPICE simulations using CMOS standard process level 2 parameters 
have shown these filters perform well, exhibiting a THD less than 1% at 100 kHz 
for a 40 mV output (1 |LiA input current) while the bias current per transistor is 10 
|liA. Also, a Monte Carlo analysis has shown these filters preserve the low 
sensitivity inherent to the passive ladder prototype. The reduced number of small 
transistors necessary leads to small integration areas. The circuits have been fed 
with ±1.5 V power supplies. 



INTRODUCTION 

Analogue continuous-time filters can be synthesized from passive doubly loaded 
ladder networks, preserving their low sensitivity to component variations. For 
integrated circuit implementations, inductors and resistors can be replaced by using 
active simulation techniques based on voltage-controlled current-sources (VCCS); 
operational transconductance amplifiers (OTA) have been successfully used as 
linear VCCS, originating the so-called OTA-C filters (Queiroz, 1988; Queiroz, 
1989), widely used in high-frequency applications. 

Analogue continuous-time current-mode filters are commonly synthesized from 
passive models applying the same techniques and the same state-equations used in 
synthesis of OTA-C filters. In this way, it is always necessary to use current-mode 
lossless-integrators made by feeding current into a linear capacitor and then 
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converting the voltage across the integrating capacitor to current, usually through a 
non-linear operator performed by a single MOSFET (Smith, 1996; Masry, 1996). 

In this paper, a novel current-mode filter synthesis procedure is proposed based on 
a transformed set of state-equations that allow the utilization of only current-mode 
linear lossy-integrators, based on current-controlled current-sources (CCCS); 
neither current to voltage nor voltage to current non-linear conversions are 
necessary (Galvez-Durand, 1996). 

The resulting filter topologies allow the utilization of low- voltage power supplies 
and can be interesting for those applications where the power consumption must be 
kept low. Lowpass, highpass, bandpass and stopband filters can be synthesized 
using this technique, exhibiting a very good performance despite the reduced 
number of small transistors necessary for these implementations. 



MOSFET-C INTEGRATORS 

The MOS transistor can be used as building block for current-mode VCCS-based 
lossless-integrators as well as for CCCS-based lossy-integrators. The first one can 
be commonly found in analogue current-mode continuous-time filter 
implementations as a consequence of the synthesis methods applied so far. 

For the sake of simplicity, all n-MOS transistors working as bias current-sources 
are omitted in all figures shown in this paper. Also, the non-linear second order 
effect introduced by the MOSFET output channel-conductance is neglected in all 
the algebraic analysis (but it will be taken into account in all simulations present 
hereafter). 

A current-mode VCCS-based lossless-integrator is usually implemented by a non- 
linear function /(•) performed by a single MOSFET. A p-MOS implementation of 
such a current-mode non-linear lossless-integrator is depicted in figure 1. 




Figure 1 Current-mode non-linear lossless-integrator 

A linear CCCS-based lossy-integrator can be implemented using standard current 
mirrors, figure 2. The cutoff frequency of this integrator is given by g/n/C, where 
gm is the input MOSreT transconductance and C is the integrating capacitor. 
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gm 



Figure 2 Current-mode linear lossy-integrator 

For these two integrators, the most important MOSFET parasitic component is the 
gate-to-source capacitor; the channel drain-to-source conductance is also important 
but its effect can be minimized by using more complex MOSFET structures 
(Galvez-Durand, 1994). 

In the sake of simplicity, the MOSFET's transconductance is going to be 
normalized to one (gm =1) in all the analysis performed here. 

SYNTHESIS 

The synthesis technique proposed in this paper is based on a doubly loaded ladder 
network modified set of state-equations; the implementation of these equations can 
be performed by using the two MOSFET-based building blocks shown in figure 3. 




a) b) 



Figure 3 Building blocks: current-mode inverter a) linear lossy-integrator (b 

The first one is a current inverter performed by a single current mirror, figure 3.a. 
The second one is a current-mode lossy-integrator performed by a single current 
mirror and a linear capacitor C attached to its common-gate node, figure 3.b. 

The proposed synthesis technique can be introduced using the lowpass case as 
example. 
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Lowpass 

The passive ladder network in figure 4, has been obtained from a generic 5th order 
lowpass approximation without finite zeros. 




Figure 4 A 5th order lowpass passive doubly loaded ladder network without finite 
zeros 

The state-equations for the prototype in figure 4 are usually written out taking as 
state variables all capacitor voltages and inductor currents. In OTA-C 
implementations, all inductor currents are converted to voltages using gyrators; 
hence, in that implementations all state- variables are voltages and every integration 
can be performed using linear OTA-C lossless-integrators. 





(.a) 


sL,i,=y-v, 


(*) 


scy^ = I2 - 14 


(c) 


sLJ4=V,-V, 


(d) 


scy, = i4-v, 


(e) 



( 1 ) 

In this way, a lossless integration implies always in performing a voltage to current 
and then a current to voltage conversion. However, all lossless-integrators in eq. 1 
can be converted to lossy-integrators by adding dununy terms to both sides of eq. 
(l.b), (l.c) and (l.d); integrators in eq. (l.a) and (l.e), corresponding to state-ca- 
pacitors Cj and C^, are already lossy due to the passive network resistive loads. 
Then, eq. 1 can be rewritten in the lossy-integrators form of eq. 2. 
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(jc, + ly, = v„ - 1 , 


(a) 


(sL, + i)l, =V,-V, + I, 


(b) 


{sC, + 1>, = - /. + V, 


(c) 


(sL, + l)/, = V, - V, + ;. 


(d) 


(iC, + ly, = 1, 


ie) 



( 2 ) 



Realizing all state-variables as currents instead of voltages, the eq. 2 can be 
implemented using the building-blocks in figure 3. 

The block in figure 5, is a novel structure for a current-mode gyrator, capable of 
generating the state variables representing the lowpass passive ladder prototype 
inductor currents. This circuit uses: the current-mode lossy-integrator in figure 3.b 
loaded with a capacitor of the same value as the inductor’s, the current inverter in 
figure 3.a and the state-variables V. and (now represented by currents), propor- 
tional to the voltages across the inductor L. in the passive prototype. This block can 
implement eq. (2.b) and (2.d). 




t 



Figure 5 Current-mode gyrator for lowpass filters 

The current inverter (figure 3. a) used in the current-mode gyrator in figure 5 
introduces a parasitic lossy-integration due to the gate-to-source capacitors of all 
MOSFET in the current inverter structure. Fortunately, lowpass filter prototypes (of 
any odd order) synthesized using the technique proposed here do not need current- 
mode gyrators with more than two output currents. Hence, the effect of this 
parasitic lossy-integration on the current-mode gyrator performance can be 
neglected. 




144 



Part Four Low-voltage and Low-power Analog Circuits 



Generalized technique 

The technique used in the lowpass synthesis can be extended to highpass, bandpass 
and stopband filters. These filters can be generated by applying frequency transfor- 
mations to the modified lowpass state-equations (eq. 2). 

The eq. 2 can be written out in the generic form shown in eq. 3 where the 
capacitive susceptances sCj have been replaced with generic admittances Yj. 



(y, + i>. = - /, 


(a) 


(y, + i)/, = V, - V, + /, 


(b) 


(Y 2 + J 2 ~ ^4 


(c) 


(y, + i)i, = V, - V, + I, 


(d) 


(y, + 1>5 = I, 


(e) 




(3) 



The frequency transformations applied to eq. 3 for the synthesis of highpass, 
bandpass and stopband filters modify the admittances Yj. The generalization of the 
synthesis technique introduced in section 3.1 is based on building-blocks capable of 
implementing the generic terms Yj +l for every lossy-integrator necessary in the 
synthesis of a given filter. In the most simple case, corresponding to the lowpass 
synthesis, the admittance Yj simplifies to the susceptance sCj, 

Also, the lowpass current-mode gyrator in figure 5 can be defined as a generic 
integrator (depending on Yj +1) and a unit-gain current-mode amplifier, figure 6. 




Figure 6 Generic current-mode gyrator 

In this way, the synthesis of any filter reduces to designing an adequate integrator. 
The figure 7 summarize the frequency transformations, the necessary admittances 
Yj and the corresponding integrator building-blocks for highpass, bandpass and 
stopband filters. 





Figure 7 Current-mode integrators for highpass a) bant^ass b) and stopband c) 
filters 
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EXAMPLES 

A lowpass and bandpass 5th order 100 kHz, 1 dB passband ripple Chebyshev filters 
were synthesized using the current-mode technique proposed here.The results were 
obtained using SPICE level 2 MOSFET model and parameters for a double metal, 
double poly, n-well, 1.2pm standard CMOS process (threshold voltage n-MOS = 
+0.68 V, p-MOS = -0.69 V). Both circuits were fed with +1.5V. 

A Monte Carlo analysis was performed varying all transistor dimensions in 100 
trials, accordingly to a 2% standard deviation Gaussian distribution. The mean of 
the filter responses, computed from the 100 Monte Carlo trials in the same figure, 
exhibits a good performance preserving the passive prototype low sensitivity. 

The results obtained scaling these filters to lower frequencies compares 
advantageously to the results shown here; at lower frequencies transistors with 
longer channels can be used, reducing the MOSFET output conductance effect. 

A differential output buffer, figure 8, is necessary to allow experimental 
measurements on these prototypes (Galvez-Durand, 1996). The THD after the 
output buffer (taken from the 100 Monte Carlo trials mentioned before) is about 1% 
for a 40mV output signal. 

For large output signals the THD at is lower than at V^, because the even order 
harmonics are canceled. For small output signals the THD for both nodes is almost 
the same and the relation between the input current and the output voltage can be 
considered linear (^ < 2 pA). 



^in [«A] 




'ii./linV] 



a) b) 

Figure 8 Differential output buffer schematic a) The THD before (V^) and after 
(V^^^) the buffer, has been taken from 100 Monte Carlo trials b); the upper 
horizontal scale shows the input current corresponding to the output voltage in the 
lower horizontal scale. 

Fortunately, the signal to noise ratio at is about 26 for a 40 mV signal, making 
possible to perform clear measurements in the region where the suppression of even 
order harmonics introduced by the differential output buffer can be neglected. 
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Lowpass filter 

The lowpass filter power consumption is about 660 |liW; only 22 pMOS transistors 
driving signal were used. All pMOS transistors were biased with 10|iA currents. 





Figure 9 Fifth order lOOkHz lowpass Chebyshev filter schematic a). The mean b) 
for each filter transfer was taken from 100 Monte Carlo trials c). 

The THD for these filters (assuming perfect matching for all MOSFET) is less than 
0.7% for a 40mV output signal (l|iiA input current), computed at lOOkHz without 
output buffer, that is, about 50% the THD obtained using Monte Carlo analysis for 
simulating MOSFET mismatches. 
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Bandpass Filter 

The bandpass filter in figure 10 has a power consumption of 1590|iW, using 53 
pMOS driving signal. Again, all pMOS transistors were biased with 10|liA currents. 




Figure 10 Fifth order lOOkHz bandpass Chebyshev filter schematic a) the mean b) 
for the filter transfer was taken from 100 Monte Carlo trials c). 
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The labels inside the picture have been simplified because of size limitations. 

The differential output buffer in figure 8 has been used for this filter too., with 
similar performance. The filter dynamic range has been equalized only for the 3 
states representing the passive prototype inductor currents. 



CONCLUSIONS 

The current-mode filters generated using the proposed technique are capable of 
working fed by a 3V low-voltage power supply exhibiting desirable low power- 
consumption and harmonic distortion (THD). 

The signal to noise ratio computed after the differential output buffer allows clear 
measurements on these prototypes. 

A novel generic structure for current-mode gyrators has been presented here; this 
structure makes possible current-mode simulation of inductors without using 
neither non-linear current to voltage nor voltage to current conversions. 

Also, this structure allows filter implementations of orders higher than 5th without 
increasing the feedback-between-states complexity. 

The results shown here can be applied to even order approximations too, using 
Moebius transformations for obtaining adequate passive doubly loaded ladder 
prototypes. 

The Monte Carlo analysis has shown the lowpass and bandpass filters reliability 
even using a standard CMOS integration process. The 2% Gaussian distribution 
used in this analysis covers errors due to precision in poly and diffusion rectangles 
and has to do with filter sensitivity (to small flaws), usually computed by SPICE 
(and other simulators) linearizing the circuit around a bias point and then 
performing sensitivity analysis using MOSFET’s small-signal model. The Monte 
Carlo method improves these kind of analysis by forcing SPICE to generate a new 
bias point for every modified circuit (hence, new small-signal components); the 
raw-data generated is statistically post-processed for obtaining a mean for the 
desired circuit transfer. 

For process and temperature effects (large variations) the filter must be simulated 
including a master bias-control, responsible by compensating for these effects. A 
master bias-control shall be designed after obtaining experimental data for these 
prototypes. 

The reduced number of small transistors needed implies in a reduced active 
fabrication area. 

The study of BJT-based building blocks for the implementation of modified state- 
equations based on lossy-integrators is underway. 

The behaviour of these structures with power supplies lower than 3V is also being 
studied for very low frequency filters. 
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Abstract 

The layout synthesis of analog circuits remains today a complex task difficult to 
automate. Various strategies are used, from a completely full-custom design up to 
an automated synthesis taking into account analog constraints. One popular 
technique developed to help the analog layout synthesis is the use of procedural 
design of basic cells. These cells are afterwards assembled either manually or using 
other CAD tools. The device generators described in this paper are developed to 
enhance layout design productivity of analog cells. These generators are written in 
language Skill for the Opus-Cadence framework. They cover the most common 
devices: resistors, capacitors, inductors, various transistor shapes, differential pairs 
and current sources. The context and the objectives of this work, the development 
strategy and the main functionalities are presented. Finally, an example is given. 
These generators have been successfully used to design several circuits. 
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CAD, Analog Layout Generation, Device Generators 
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1 INTRODUCTION 

During the last few years, the synthesis tools for analog or mixed-signal integrated 
circuits (ICs) have achieved a lot of significant improvements even if the 
commercial offer still remains reduced (Carley, 1996). The progresses are real at 
all levels: from layout cell synthesis up to macroblock synthesis including 
architecture or topology choice and circuit sizing. There are several factors 
increasing this tendency: 

• More and more, today, a complete system integration on a chip becomes 
possible and necessary. To perform that, the implementation of analog and 
digital parts, sometimes also sensors or other devices on a same circuit is 
required. The very competitive market of microelectronics induces the need of 
methods and tools to design all parts of the system. A lack of tools for one part 
means much more human effort and a significant overcost for the final product. 

• A large variety of mature tools is available for digital circuit design. Even if 
they are not always seen as ideal solutions from the designers point of view, 
they give a huge productivity gain which remains without counterpart in the 
analog domain where often a circuit design is still based on full custom 
methods using a very small set of CAD tools, mainly simulation and some 
analysis tools. This relatively mature state of tools for digital circuits has two 
consequences: Some previously digital CAD designers are now working to 
develop tools for analog or mixed-signal circuits. Some methods and 
algorithms earlier used to solve digital problems may often be adapted and 
partially re-used for other applications. 

• Even if the number of basic components is much more reduced in analog 
circuits compared to digital ones, each component has to be considered with a 
much more precise description and modelization. Also, the number of 
parameters which may influence the performances or more generally the 
implemented function itself is big. A lot of degrees of freedom have to be 
managed. The first task is to acquire a detailed knowledge of the structures 
involved. Then, it becomes possible to develop methods and finally tools to 
help the design. The low cost computation power today available simplifies the 
development and use of specific algorithms to solve complex problems of 
analog design in a reasonable time. 

One of the topics which is the object of a lot of efforts and improvements since a 
couple of years is the analog layout synthesis. In this area, two different problems 
are addressed: the cell layout synthesis which consists in the creation of masks 
from a transistor-level schematic under a set of constraints and the assembly of 
cells or macrocells, that is, the placement and routing of the whole analog circuit. 
This paper is focused on the study of analog cell synthesis. 

The objective of analog cell layout synthesis is ideally to obtain masks 
completely according to the constraints and the desired performances. The mask 
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layout should be as good as a full custom design realised by an expert designer, 
however with a much more shorter development time. To reach this objective, 
various strategies have been proposed. In the following section, the main 
approaches and their characteristics are presented. 

Then, the analog cell layout generator approach is discussed with more details. A 
set of efficient cell layout generators has been developed. The layout strategy and 
the context of the development are presented. An example of a waffle shaped 
transistor generator is presented. This set of generators has been implemented in 
the Cadence Opus framework, using the parameter cells facilities. 



2 ANALOG LAYOUT STRATEGIES 

From the earlier attempts up to the most recent ones, the approaches have evolved 
from nearly digital strategies to analog performances oriented strategies. The first 
approaches developed were based on some analog device generators and place- 
and-route tools derived from the existing algorithms for digital circuits 
(Rijmenants, 1989). Another strategy was to develop a generator program for each 
different basic cell. This approach is interesting when only few changes 
(orientation, sizing, etc.) have to be realised in the layout. In such a case, the 
generation is fast and satisfactory (Kuhn, 1987). These tools are based on a rich 
device generator library. They require generally a considerable designer effort to 
perform the assembly of the generated structures in order to respect the analog 
constraints which are not directly included in the synthesis process. A strong 
interactivity is then necessary. 

More recently, various tools have been developed in order to include analog 
characteristics into the synthesis of the layout (Cohn, 1994; Lampaert, 1995; 
Miliozzi, 1996). These systems are using also generators but only for very basic 
structures. With powerful algorithms, they construct more complex structures 
during the layout synthesis process, particularly by merging components. The 
objective is to keep all possibilities avoiding a priori choices without maintaining a 
rich set of complex generators. However, a limitation is that some optimised 
configurations can not be obtained. In some critical applications, the use of these 
optimised device layouts is important in order to reach the desired performances. 
The goal of theses tools is to complete in a fully automated way the optimised 
synthesis of an analog circuit from the definition at a higher level of all constraints 
and specifications of the design. One consequence is that interactivity should not 
be necessary. In fact, many algorithms used (simulated annealing for example) do 
not allow human interaction. 

All strategies are therefore based on the use of generators, simple or complex 
depending mostly on the algorithms used for assembly. Various works recently 
described propose the use of sets of generators of medium complexity: optimised 
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shaped transistors, basic active devices or passive components (Bruce, 1996; 
Owen, 1995). The work presented in this paper belongs to this strategy. 



3 PARAMETRIZED CELLS UNDER THE CADENCE-OPUS 
FRAMEWORK 

The set of generators has been developed with the facilities (Virtuoso Layout 
Editor - Parameterized cells) proposed in the Cadence Opus framework. The 
generators may be defined graphically or writing the corresponding code in the 
Skill language (Cadence). In this case, it was preferred to directly write in the Skill 
language because of the high level of parameterization necessary: none of the 
parameter values is hardcoded. The main transformations available for 
parameterized cells (pcells) are: stretch of polygons, repetition, inclusion of 
parameterized labels, inclusion of parameterized levels of masks, conditional 
inclusion of shapes, repetition along shapes, inheritance of parameters and 
parameterization of properties. The pcells may be instantiated as any other cell but 
the parameters may be specified each time. 



4 DEVELOPMENT STRATEGY 
4.1 Technology definition 

One of the most important objective is the independence of the tool with respect to 
the technology evolutions. In practice, a change of a technology rule must not 
affect the code of the generators and must not require individual fitting or 
rewriting of each generator program. That is why the geometric rules and some 
electric rules were defined in a fully independent way: it was preferred to define a 
proper technological description without interference with the Opus technology 
file. This choice comes from two essential considerations: 

• with proper rule definition, the generator program is immune with respect to 
some technology design-kit alterations and the requirements of other tools; 

• moreover, a potentially future recoding in a more common language will be 
simpler if there is no strong link between the Opus structures and our system. 

However, for major technology changes, the generator structures have 
sometimes to be changed. 

The technology data are stored in a structure composed of basic parameters 
values and lists. The basic parameters are the minimum pitch and the name of the 
technology. The lists group the layers (group Layer) or geometrical or electrical 
rules of a same type. The groups of geometrical rules are Width, Space, Overlap 
and Intersection. The groups of electrical rules are Resistor and Capacitor. A 
group Other is used to described particular rules which do not correspond to the 
previous groups. 
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A rule is a set of keywords with an associated value. The first keyword is the 
name of the group. The following keywords are usually symbolic names of layers. 
That is, for each mask layer level, is defined one or more symbolic name. For 
example, the following rules may be defined: 

'(Layer poly 1 "poly") 

'(Layer gate "poly") 

These rules define two symbolic names of layers: polyl and gate. The definition 
of the symbolic names of layers has to be done first in order to use them for other 
kinds of rules. For instance, the overlap of polysilicon over a contact may be 
defined as following: 

'(Overlap polyl contact 0.30) 

Generally, only one or two symbolic layer names are necessary to characterise a 
rule. Sometimes, however, a greater number is required. The use of non symbolic 
names as keywords is also possible. This is particularly useful for the rules 
included in group Other. For example: 

'(Other contactOnCapacitor t) 

This rule specifies that the implementation of contacts on the top plate of a 
capacitor is allowed. 

The formalism used to retrieve a value of a defined rule is similar to the one 
described above. 

4.2 Generator description 

A generator is a program which builds shapes, generally rectangles, from a set of 
user defined variables and a set of technology dependent predefined parameters. 
For example, for a MOS transistor generator, the channel width and length are user 
variables and the minimum width or the minimum length are technology 
dependent parameters. 

Each generator is described in a separated file. They are all described with the 
following organisation: 

• Declaration of the variables: Name and default value. 

• Definition of local variables from the values of necessary technological rules 
(layers, geometrical and electrical rules). 

• Computation of some local variables necessary to generate the shapes from the 
variables and the previous local variables. 

• Generation of the shapes. 

In some particular cases, a more complex structure has been used. For instance, a 
square capacitor array generator uses a set of small generators which produce 
different parts of the global structure. A program is therefore necessary to control 
the assembly of the parts of the array. For that the homogeneous array facilities 
provided by Opus are used. 
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4.3 Generator parameters 

The code of the generators is entirely defined using variables or parameters. There 
is no, direct reference to a numerical hard coded value. When a generator needs a 
technological parameter, first a set of rules associated with the generator is 
examined. All rules may be redefined for a particular generator. If not found, the 
rule is searched in the technology rule set. Finally, if not found, a default value is 
assumed. 

Some data are only searched in the generator parameter set, for instance, the 
layers used. Sometimes, other parameters are defined to specify a fixed 
characteristic which may be different from a technology to another or from a 
desired configuration to another. For instance, a generator of squared capacitor 
arrays has been defined, with the possibility to allow or not the direct 
implementation of contacts on top of the upper plates. 



5 GENERATORS OVERVIEW 

The development of a generator library is a cumbersome task: The variety of 
generators must be large enough in order to cover the main needs of analog circuit 
design. Also, the generator set should offer some elaborated structures, to give a 
higher gain with respect to a fully manual design. However, the development of a 
complex generator may involve a loss of universality. That is why a lot of the 
automatic analog synthesis tools often use a very small set of device generators 
(Cohn, 1994). They consider that more complex configurations may be reached 
through the optimisation algorithms. Even so, some structures are really difficult to 
obtain automatically. That is why a set of generators should contain the basic 
structures and some others more elaborated ones. These are particularly important 
to optimise some electrical performances or to reduce area. 

Today, the set of generators includes: 

• folded transistors; 

• differential pairs; 

• current sources; 

• waffle shaped transistors; 

• capacitor arrays; 

• inductors; 

• resistors; 

• contacts. 



6 WAFFLE SHAPED TRANSISTOR 

The possibly interesting shapes of a transistor are numerous. The waffle shaped 
configuration is well suited to implement very wide transistors (large channel 
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width over length ratio). The structure offers a high channel width over length 
(W/L) ratio per unit area while keeping tow source, drain and gate resistors. 
Furthermore, the reduced areas of drain and source allow low junction capacitors 
on these terminals. Particularly, the side capacitance influence is low. A typical 
waffle-shaped transistor is represented on the figure below (Figure 1): the gate is 
subdivided in segments distributed horizontally and vertically in order to form a 
grid. The source and drain connection points take place between the gate segments. 




Figure 1 Waffle shaped transistor 

A drawback of this structure is that the connections between drains and the 
connections between sources are necessarily realised traversing the gate segments 
which induces an increase of parasitic capacitance between the source and the gate 
and between the drain and the gate. This parasitic capacitance may cause problems 
for some applications. An other limitation of this structure compared with simpler 
ones is that there is no possibility to merge or to realise interdigitized transistors. It 
may be noticed also that this structure is not well characterised: because of the 
crossing of gate segments, the total channel width of the transistor is difficult to 
evaluate. As long as there does not exist an accurate model to compute the width 
of the transistor, the use of this structure will be restricted to applications requiring 
a low precision on the characteristics. The structure may be used also in digital 
circuits to implement buffers. 

In this section, is described how to design the component and the main 
characteristics of the developed generator. 

The following notations are used (Figure 2): 

• W total channel width; 

• L channel length; 
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• hg number of horizontal gate segments; 

• vg number of vertical gate segments; 

• dg distance between two parallel gates; 

• hw horizontal external width of source or drain; 

• vw vertical external width of source or drain; 

• gc factor to take into account the crossing of gate segments ( 0 < gc 

<i); 

• dd coefficient to compute distances with diagonal connections (dd > 

V2). 



A 

V 




Figure 2 Notations 



6.1 Distance between gate segments 



It is interesting to minimise the distance between gate segments in order to 
maximise the gate density and to minimise the source and drain areas. Hence, the 
parameter dg is computed from the geometric rules of the technology. In most 
cases, the diagonal constraint on metal 1 determines its value. 



6.2 Total width of the transistor 



It is assumed in this section that the number of horizontal and vertical gate 
segments are parameters. The expression of the total width as a function of these 
parameters is deduced. 

For the core of the transistor (without the external channels): 

-l)xhgx (dg + gcxL) + (hg -l)xvgx (dg + gcxL) 

For the external channels: 

^extern = Vg X (2 X VW + gc X L) + /ig X (2 X /iw + gc X L) 

w =w +w 

total core extern 
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6.3 Horizontal and vertical numbers of gate segments 

The external channel widths hw and vw admit a lower bound we^.^. That is 
determined by the area necessary to implement a contact on an external source or 
drain. 

In order to determine the horizontal and vertical numbers of gate segments, the 
previous equation of substituting hw and vw by we^.^ is used. A set of couples 
(hg, vg) is determined such that: 
hg, vg, we^) < 

W{hg+ I, vg, we^„ we„J > IV, 

W( hg, vg + 1, we^, wej > 

From this set, a specified form factor may be used to choose gate segment 
numbers. Then, hw and vw are adjusted to obtain 



6.4 Waffle shaped transistor generator 



Currently, the implemented waffle shaped transistor generator is defined from the 
following parameters: 

• length Length of the channel; 

• horFoldF actor Qig) number of horizontal gate segments; 

• verFoldFactor (vg) number of vertical gate segments; 

• lateralHorWidth (hw) horizontal external width of source or 

drain; 

• lateralVerWidth (vw) vertical external width of source or drain. 

From these parameters, the total width of the transistor channel is computed. 

Also, the total area, perimeter of source and drain and parasitic capacitors are 
calculated. 



7 CIRCUITS DESIGNED WITH THE GENERATORS 

Various circuits have been designed using the set of generators described in this 

paper (Porte, 1994; Oliaei, 1996). Some of them are shortly described below. 

• A circuit was designed to implement an Operational Transconductance 
Amplifier with differential structure. The layout design was easily performed 
based on a symmetrical axis. Generators were used for differential pair, but 
also for other structures like current mirrors, output stage and common mode 
feedback. 

• A p-n junction-based precision temperature transducer circuit was integrated. 
All analog parts of the circuit were designed using the generators (resistors, 
operational amplifiers, switches, capacitor arrays). This circuit is currently in 
fabrication (Freire, 1996). 
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• A switched current sigma delta modulator has been implemented. An accurate 
relationship between different currents is essential in this type of circuits. 
Generators were used to realise track and hold, integrators and comparators of 
currents. The circuit is based on switched-current class AB sigma delta 
modulator (Oliaei, 1997a). 

• A current multiplier circuit has been designed. The circuit layout is based on a 
symmetrical axis. Cascoded current mirrors are particularly used. This is a 
CMOS class AB current multiplier (Oliaei, 1997b). 

• Recently, a circuit was designed to test the realisation of inductors in CMOS 
technology. Generators were used to realise different inductors to test the 
feasibility for radiocommunication applications. 

In future, an advanced version of the generator set will be associated to a sigma- 

delta switched current structure synthesis tool currently in study. 



8 FUTURE WORKS 

The principal objective of development is to constitute a complete system for 
analog cell synthesis from specifications to the final layout. 

Currently, the generators are being rewritten in C language. The following 
improvements are incorporated: 

• two representation levels are used: a layout level and an abstract level. The 
abstract level contains a subset of geometrical and electrical data useful for the 
placement program. 

• when it is not already implemented, parasitics are calculated. 

The entries of the analog synthesis tool will be the technology data, user 
specifications and a netlist generated with COMDIAC, a tool developed at the 
Ecole Nationale Superieure des Telecommunications. From a set of user 
specifications and technology data, COMDIAC produces a sized netlist with some 
other results: operating-point tensions and currents, obtained specifications and 
limits. 

From the results produced by COMDIAC, the interesting shapes at the abstract 
level will be generated for each device of the netlist. 

At the same time, a first selection of potentially interesting topologies will be 
performed. This first placement will be based only on symmetries, weighted 
matching properties of elements and eventually some user constraints. At this 
phase, the user may interact to eliminate or impose topologies. 

Then, each selected topology will be analysed using the abstract level 
representation of devices, an evaluation of routing and the electrical characteristics. 
The various orientations of the devices will be tested. As in (Lampaert, 1995), the 
sensitivities will be used in order to evaluate the degradation of performances 
induced by a topology. A subset of shapes with the best results will be kept. 
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Finally, each selected shape will be routed and analysed again in order to select 
the best one. 

In the proposed strategy described above, it will be tried to avoid algorithms which 
include random choices as it occurs with simulated annealing or some genetic 
algorithms. 



9 CONCLUSION 

A set of layout generators for basic devices has been presented. Written in Skill for 
the Cadence-Opus framework, they are easily adaptable for compatible CMOS 
technologies by redefining a parameter file. They have been used successfully to 
design the layout of various analog integrated circuits in a double-polysilicon, 
double-metal CMOS technology. The use of the generators reduces the time 
required to layout a circuit and allows to experiment rapidly various configurations 
in order to obtain the best performances. Taking advantage of the formalism 
defined, new generators may be created easily. Rewritten in C language, these 
generators will be a base for a analog cell synthesis tool and for a sigma-delta 
switched-current structure synthesis tool. 
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Abstract 

The Extended True Single-Phase-Clock (E-TSPC), an extension of the TSPC CMOS 
circuit technique, is proposed and analysed. This technique consists of a set of 
composition rules to build CMOS single-phase circuits. The composition rules are 
provided to avoid race problems and to preserve data during the holding phases. The 
used CMOS blocks are the conventional static CMOS logic, n/p dynamic logic, n/p 
latch, data precharged, and the new N-MOS like blocks. Design results show that 
the E-TSPC can achieve 70% speed improvements, comparing with conventional 
TSPC techniques, and large power and area savings. A complete dual-modulus 
prescaler (divide-by- 128/1 29) was implemented in a 0.8pm CMOS process, and a 
1.61GHz rate was achieved with a 14.2mW power consumption. 

Keywords 

VLSI design, high speed digital CMOS, clock strategy. 



1 INTRODUCTION 

Several CMOS clocking policies have been proposed to design VLSI systems. The 
pseudo-two-phase logic was one of the earliest techniques (West, 1985). Later on 
two-phase logic structures were proposed, like the Domino technique (Krambeck, 
1982) and the NORA technique (Goncalves, 1983 and 1984). A single-phase clock 
policy was first introduced in (Yuan, 1987) (the True Single-Phase-Clock (TSPC)) 
and subsequently advanced by (Afghahi, 1990). 

In this work we introduce and analyse the new Extended True Single-Phase-Clock 
CMOS circuit technique (E-TSPC). E-TSPC is an extension of the TSPC technique 
and consists of a full set of composition rules to build CMOS single-phase circuits, 
which uses static, dynamic, latch, data precharged, and N-MOS like blocks. The 
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new N-MOS like blocks enhance the capability of the technique, since they allow 
building p-MOS chains with speed comparable to n-MOS. The result is a powerful 
technique which permits the design of complex and fast circuits. 

Next section presents the main design rules of the E-TSPC technique, the proofs 
of their correctness, and an exception rule. The new N-MOS like blocks are 
introduced in section 3. Some design examples are considered in section 4, and the 
main conclusions are drawn in section 5. 



2 COMPOSITION RULES FOR THE E-TSPC CMOS TECHNIQUE 

We will present in this section the composition rules for E-TSPC. They were built in 
a way similar to that of NORA (Goncalves, 1984). The NORA technique is based on 
two-phase logic structures, and it uses static, n/p-dynamic, and C^MOS logic blocks. 
The composition rules in NORA were provided to avoid race problems. 

2.1 Basic CMOS blocks 

An E-TSPC circuit can use static CMOS logic, n/p-dynamic blocks, n/p-latches, and 
high (PH)/low (PL) data precharged blocks (Dp blocks) (Figures 1 and 2). 




d) n-lotcn t>look e) js-lortcn block 



Figure 1 Construction blocks of the E-TSPC technique. 

Dp blocks require a more detailed explanation (Yuan, 1993). In Figure 2 we show 
how PH and PL blocks can be formed, using as example the logic function 
d=a(b+c)- The circuits are structured starting from the static block. When a PH 

block is desired, for instance, we modify the n-transistor logic by handling the 
parallel branches. For each group of k parallel branches, we cut (k-1), leaving only 
one branch. This process is depicted in Figure 2, and the two resulting PH blocks, in 
Figure 2b. Similarly, a PL block can be obtained if p-transistor branches are 
modified. The final speed of the Dp block depends on the cut transistors. 

The PH block inputs which are connected to the block n-transistors and the PL 
block inputs, connected to the p-transistors, are called precharged inputs (pc-inputs). 
The PH blocks whose pc-inputs are high, when the clock is low, and the PL blocks 
whose pc-inputs are low, when the clock is low, are called n-Dp blocks. Similarly, a 




E-TSPC: extended true single-phase-clock CMOS circuit technique 167 



p-Dp block denotes a PH block with pc-inputs at high or a PL block with pc-inputs 
at low whenever the clock is high. 





hC 






a) static CMOS 



■H 



pc-Inputs: a, to 
not pc-lnp«Jts: c 






]P 

pc-Inputs: a, c 
not pc InpLits: to 




to) PH tolocics 



>• 



C) PL tolocks 




pc-Inputs: a,c 
not pc-lnpi_its: to 



Figure 2 Transformation from a static block (a) into Dp blocks: (b) PH blocks; (c) 
PL blocks. 



2.2 Composition rules 

We will present the composition rules. First, the concept of data chain is done. 
Definition 1: A n-data chain is any not cyclic signal propagation path: 

1. containing at least one n-latch, or one n-dynamic, or one n-Dp block; 

2. starting in a circuit external input, or in the output of a p-latch, or p-dynamic, or 
p-Dp block; when this output is followed by static blocks in the normal data 
flow, the data chain starts in the output of the last static block; 

3. going through static, n-dynamic, n-Dp, or n-latch blocks; 

4. regardless of the number and ordering of the blocks defined above; 

5. finishing in a circuit external output, or in the input of the first p-latch, or p- 
dynamic, or p-Dp block. 

For the p-data chain, equivalent definition applies, changing n by p and vice-versa 
everywhere in the definition (we will use x-data chain, where x is n or p). 

Except for clock skew race, TSPC systems are subject to race errors and 
connection limitations, which are equivalent to NORA two-phase race and 
connection difficulties. Additionally, new connection rules should be incorporated 
to warrant the correct operation of the n/p-latch (Yuan, 1987 and Afghahi, 1990) 
and the PL/PH blocks (Yuan, 1993). 

Five composition rules are proposed below for the E-TSPC technique. Any data 
chain should observe all rules for correct operation. 

Composition rule I (rj): Consider any x-data chain. The x-data chain input should 
be: an input of a dynamic block, an input of a latch, or a not pc-input of a Dp block. 

Composition rule 2 (r^): Consider any X“data chain. A x-latch must not drive. 
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directly or through static blocks, a x-dynamic or a x-Dp block. 

Composition rule 3 (rj; Consider any x-data chain. The number of inversions 
between: 

Tj. . any two adjacent x-dynamic blocks must be odd; 

Tjj. any two adjacent Dp-blocks of the same type must be odd; 

any two adjacent Dp-blocks of complementary types must be even; 
rjj. a PH (PL) block adjacent to a n (p)-dynamic (or vice versa) in a n (p)-data chain 
must be even; 

Tj,. a PL (PH) block adjacent to a n (p)-dynamic (or vice versa) in a n (p)-data chain 
must be odd. 

Where two blocks are called adjacent if there are only static blocks between them. 

Composition rule 4 (rj: Consider any x-data chain and the last dynamic block in 
this data chain (when it exists). The number of signal inversions, from this dynamic 
block up to at least one x-latch, must be even. 

Composition rule 5 (rj: Consider any x-data chain. It must have one of the two 
configurations: 

Tj,. at least one dynamic block and one latch; 

Tjj. at least two latches and an even number of inversions between them. 

2.3 Analysis of r,-r^ correctness 

We will present six theorems which show that when the composition rules r,-r, are 
obeyed, no problems due to discharge, pre-charge, and data holding occur. The 
proofs of the theorems demand two additional hypotheses: 

Hypothesis 1 (hyp,): The clock phases of the system are long enough to permit all 
data chain input signals and output of dynamic blocks in holding to propagate up to 
the data chain outputs. 

Hypothesis 2 (hyp,): The circuit external inputs should be stable during the 
evaluation phase of the data chains. Therefore, the external inputs of n-data chains 
should be stable when clock= “1” ((|)=”1”); the external inputs of p-data chains 
should be stable when clock= “0” ((|)=”0”). 

Additionally, a definition of block order is necessary in the theorem proofs. 
Definition 2: Consider a Dp or a dynamic block B^, and the set F= { all x-data 
chains which go through B^, where if B^ is a n-block, x is n, and if B^ is a p-block, x 
is p). We call the number 0 (Bq)= maximum! (number of x-Dp blocks which 
precedes the Bq block in the data chain Dj.) VD^. e F} the Bq order. 

The theorems and their proofs will follow. 

Theorem 1 (2f: If the composition rules are obeyed by all data chains, then the 
output of a n-Dp tp-DPl block will reach the 

“1” value, for a PL block, or “0” value, for a PH block, 
after some time in holding phase, and it will keep this value while (|)=”0” (<j)=”l”) . 
Proof for theorem I (for theorem 2. it is analoeue): To proof this theorem, we will 
proceed by mathematical induction with respect to the block order. 
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First, we should establish that the theorem is true for any n-Dp block Bq with zero 
order. There are two possibilities (r^ and r^): 

i. B„is a PL block; in this case, all data chains going through the pc-inputs are 
connected to n-dynamic blocks, and there is an odd number of inversions 
between B^ and the dynamic block (r^J; 

ii. B„is a PH block; in this case, all data chains going through the B^ pc-inputs are 
connected to n-dynamic blocks, and there is an even number of inversions 
between B^ and the dynamic block 

When <t>=”0”, the n-dynamic outputs are “1”, and consequently all pc-inputs are 
“0” and “1” for PL and PH blocks, respectively. The pc-inputs will force the PL 
output to “0” and the PH output to “1”; thus, the theorem is true for zero order 
blocks. 

It remains to show that if the theorem is true for any n-Dp block of order N or 
less, it is also true for a n-DP block B^^j with order equal to N+1. 

First, let us take the as a PL block. Any data chain going through a pc-input of 
Bj^^j has one of the following blocks: 

i. a n-dynamic block, and there is an odd number of static inversions between Bj^^j 
and this block (r^J; 

ii. a PL block, and there is an odd number of static inversions between Bj^^, and the 
PL block (r^^); 

iii. a PH block, and there is an even number of static inversions between Bj^^, and the 
PH block (rj. 

In the case (i), the dynamic block will impose “0” to the B,^^, pc-inputs when (|)=”0”; 
it is similar to zero order blocks. On the other hand, the PL and PH blocks 
connected to cases (ii) and (iii), have order N or less. In consequence, their 
outputs are all at “1”, for PL blocks, or “0”, for PH blocks, when (|)=”0”. These 
signals will arrive at the B,^^, inputs, after passing through the inversions, with value 
“0”. Hence, all B^^j pc-inputs will be at “0”, and its output will be at “1”. 

A similar reasoning can be applied if B^^^ is a PH block so the theorem is true for 
any block with N+1 order. Hence, by induction, the theorem is true for any n-DP 
block. ■ 

Theorem 3 {43; Consider a n-data fp-data) chain, and let h be the output of the data 
chain last latch. If the composition rules are obeyed by all data chains, then h is 
stable during the whole holding phase (<|)=”0” 

Proof for theorem 3 (for theorem 4 it is analogue) : Consider all data chains D^,j, 
Dc 2 , .. which go through L They are all n-data chains. The A value will be 
modified only if (we use hypj): 

a. the input of any n-data chain was modified, and the signal propagates to A; 

b. the pre-charge of a dynamic block of any data chain propagates to k 

We will show that none of these arrangements can occur. Consider the data chain 
Dq. Two cases are possible: 

i. the data chain has two n-latches, and there is an even number of inversions 
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between them (r^^); 

ii. the data chain has at least one dynamic block, and there is an even number of 
inversions between the last dynamic block and one n-latch (r^, r^^). 

First, consider the case (i); Bj and B 2 designate, respectively, the first and the 
second latch of D^. After B, no dynamic block is allowed (rj. Thus the data chain 
D^. can modify S only when a signal propagates from B, input up to Suppose that a 
Bj input has a “0”— >’T” transition. This transition can not modify the block output 
and, consequently, the value. Alternatively, if, during the (|)=”0” phase, any Bj 
input changes from “1” to “0”, the transition may act on the Bj output, putting it at 
“1”, and propagate to the B 2 input; as a result, transition is imposed on B 2 

input (if the signal does not propagate to B 2 input, it does not cause S modification). 
This type of transition does not change the B 2 output value, so the signal 
propagation is blocked. Hence, in case (i) there is not propagation up to L 
Consider now the case (ii); the last dynamic block will be called Bp and the latch 
which follows B, after an even number of inversions, B 2 . When (|)=”0”, B, is pre- 
charged at “1”. The Bj pre-charge arrives at B 2 input with “1” and can not propagate 
through this latch; additionally, any transition which arrives at a B, input is stopped 
since the block is pre-charging. In conclusion, i& keeps its value both in the case (i) 
and in the case (ii), so the theorem is true. ■ 

Theorem 5 (^: Consider a n-data (p-data) chain in evaluation phase ((|)=”r’ 
If the composition rules are obeyed by all data chains, then can only 
appear transitions of: 

in the inputs of dynamic blocks, 
in the inputs of PL blocks, or 
“l”-»”0” in the inputs of PH blocks. 

Proof for theorem 5 (for theorem 6, it is analogue) : In order to proof this theorem, 
we will proceed by mathematical induction with respect to the block order. 

First, we should establish that the theorem is true for any block B„ with zero 
order. Let us start with a n-dynamic or a PL block. A B„ input is connected to: 

i. one circuit external input (r,, except for pc-inputs); 

ii. the output of one or more blocks Bj,j, where B^j is a: 

a. p-latch, and there is any number of static inversions between B^and B^j (r^); 

b. n-dynamic block, and there is an odd number of static inversions between B„ 
and Bp, (r^, rj; 

In the case (i), the external input is stable when (|)=’T” (hyp 2 ). Hence, no 
transitions occur in B^ input for the case (i). In the case (ii), the B„ input will be 
modified only if a B^j output is modified, and it propagates up to B„. When B^j is a 
p-latch, (ii.a), the modification is not possible since B^j output is stable while (|)=’T” 
(theorem 4). Alternatively, when B^j is a n-dynamic, (ii.b), only “l”-»”0” transitions 
are possible at its output during the evaluation phase, and, except for B^ 

input has no transitions. A similar reasoning can be applied to PH blocks. Thus, the 
theorem is true for zero order blocks. 
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It remains to show that if the theorem is true for any block of order N or less, it is 
also true for a block with order equal to N+1. 

Suppose that is a n-dynamic or a PL block. A B,^^, input is connected to: 

i. one circuit external input (r^, except for pc-inputs); 

ii. the output of one or more blocks B^j, where B^. is a: 

a. p-latch, and there is any number of (static) inversions between Bj^^,and Bj^j (r,); 

b. n-dynamic block, and there is an odd number of (static) inversions between B^^^, 
and B„ (r^, rj; 

c. PL block, and there is an odd number of (static) inversion between Bj^^, and Bj^ 

d. PH block, and there is an even number of (static) inversion between and 

rJ- 

For the case (i), the reasoning is the same as the zero order blocks. In the case (ii), 
the Bj,^, input will be modified only if a B^^ output is modified, and it propagates up 
to B^^j. For (ii.a) and (ii.b), the arguments are also the same as zero order blocks. 
For (ii.c), Bj^^ has order equal to or less than N, so the inputs of the block have only 
“0”-^”l” transitions; therefore, its output has transitions of “1”— >”0”, and, due to 
the odd number of inversions, the B^^j input may have only “0”^”1” transitions. In 
(ii.d), Bj^ has order equal to or less than N, so its inputs have only “1”— >”0” 
transitions, and again the inputs of B^^, may have only “0”^”!” transitions. 

A similar reasoning can be applied to PH blocks so the theorem is true for any 
block with N+1 order. Hence, by induction, the theorem is true for any block. ■ 

2.4 Exception rule 

Although the above described rules are necessary to avoid race problems, typical 
TSPC systems do not follow some of them. The most common exception is found in 
connecting two D-flip flops (Figure 3). In such configuration p-data chains (B^, or 
with only one p-latch (r, violation). In consequence, the p-latch output 
may change during the holding time. 




Normally the delay between nodes a, and 6, is long enough to ensure that bj is 
fully discharged through transistors N, and N^. In this case, the second D-flip flop 
works properly. 
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We introduce a simple exception rule which covers the D-FF connection case. 

Exception rule (rj: Configurations similar to the one of Figure 3, where rules and 
fg are not obeyed, are accepted. 



3 N-MOS LIKE LOGIC EXTENSION 

The given rules allow extremely complex logic designs. When high speed is also a 
requirement, restrictions on the use of p-dynamic and p-latch blocks should be 
imposed. These blocks have at least two p-transistors in series, which may 
considerably reduce the maximum speed. Normally in such applications, the p-data 
chains are limited to one block, and most logic operations are handled with n-data 
chains. Of course, when designs are done in this way, part of the rules strengths are 
lost, and deeper pipelines are necessary. 

We can minimise this difficulty and also increase the n-data chain speed by means 
of N-MOS like blocks. Similar technique was used in (Chang, 1996), but the authors 
restrained the changes to D-flip flops. 

The presence of two complementary logic blocks, one with p-transistors and the 
other with n-transistors, is one characteristic of CMOS logic; conversely, in N-MOS 
logic, complementary blocks do not appear, and the correct operation is guaranteed 
through the transistor sizing. New N-MOS like blocks are built from the dynamic 
blocks and latches, and their operation is based on transistor sizing too. Figure 4 
shows dynamic/latch blocks and their N-MOS like version. 




Figure 4 Conversion to N-MOS like blocks. 

The transistor dimensions of the modified blocks should follow the Table 1 
constraints. Those constraints specify which part of the circuit, n or p-transistors, 
must impose the output value when both n and p-logic blocks are conducting. In 
Figure 4 the dominant parts are drawn with bold lines. 

Since the modified blocks have a reduced number of transistors in series, they are 
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faster but consume more power. In consequence, they should be only used in critical 
data chains, where the desirable speed has not been reached. The exchange of 
blocks, N-MOS like for normal blocks, is easily performed since both types of 
blocks are subject to exactly the same rules (r^-r, or rj. The composition rules, 
and r^, the static blocks, the n/p-dynamic, the n/p-latch, the PH/PL data precharged, 
and the N-MOS like blocks compose the E-TSPC technique. 



Table 1 Conditions for correct operation in the N-MOS like blocks 



Circuit block 


clock^high 


clock=low 


N-MOS like 
n-dynamic 


no constraint 


output is high (independently 
of inputs) 


N-MOS like 
p-dynamic 


output is low (independently of 
inputs) 


no constraint 


N-MOS like 
n-latch 


output=high if the p-trans. logic 
is conducting; otherwise, low 


no constraint 


N-MOS like 
p-latch 


no constraint 


output=low if the n-trans. logic 
is conducting; otherwise, high 



4 DESIGN RESULTS 

The full strength of the E-TSPC technique, mainly for high speed applications, is 
only evaluated through circuit examples. A high-speed dual-modulus prescaler 
(divide-by- 128/1 29) was designed using a standard 0.8|im CMOS bulk process (the 
effective length is 0.7|im). 

Figure 5 shows a schematic of the circuit. The cross-hatched part of the circuit, 
composed of three D-FFs and two logic gates, forms a divide-by-4/5 counter. The 
signal div32 selects if it counts up to 4 (div32=high) or up to 5 (div32=low). The 
five D-FFs at the bottom of the figure form a divide-by-32 counter. The fractional 
division ratio of the prescaler, 128 or 129, is selected according to Sm signal. 

The complete layout of the divide-by-4/5 counter, which composes the high speed 
critical part of the prescaler, was drawn, and SPICE netlists were extracted for four 
different approaches. These approaches are: 

Dfjj: design with conventional rise edge-triggered TSPC D-FF (Figure 3); 

Dc 2 - design with rise edge-triggered D-FF, and further optimisation applying the E- 
TSPC technique; 

D^j: design with a modified fall edge-triggered D-FF (Chang, 1996); 

D^: design with fall edge- triggered D-FF, and further optimisation applying the E- 
TSPC technique (Figure 6). 
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Figure 5 Schematic of a dual modulus prescaler (divide-by- 128/1 29). 

Table 2 shows the maximum speed and the power consumption for each design. 
The results were obtained with SPICE simulations of the extracted netlists (slow 
parameters), power supply at 5V, and room temperature. 

Table 2 Maximum speed and power consumption results for the four designed 
divide-by-4/5 counters (SPICE simulations) 



Design 


Speed(GHz) 


Power( pw/MHz) 


Da. 


0.98 


'ill 


Da. 


1.28 


4.45 


Das 


1.39 


4.85 


Das 


1.67 


5.62 



The speed improvement from to is higher than 70% and from to is 
20%. On the other hand, the power consumption increases 72% from to Z)^^. The 
later result is not surprising, for the E-TSPC used N-MOS like blocks, and confirms 
that these blocks should only be placed in critical parts. Additionally, the areas of 
the four designs are practically the same. 

The 4/5 counter layout was completed in order to form the full prescaler. The 
32 counter D-FFs were built with conventional rise edge-triggered TSPC D-FF 
(Figure 3). In this case the clock signal, from the divide-by-4/5 counter to the 32- 
counter, was inverted. 

The results of the prescaler simulations, for slow parameters, are presented in 
Table 3. The recently published performance results of some prescalers which use 
TSPC flip-flops are also reported. In (Huang, 1996), the prescaler is implemented 
with rise-edge-triggered TSPC D-FFs and size optimised to reach maximum speed. 
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In consequence, the achieved area and power consumption are very high. Fall-edge- 
triggered TSPC D-FFs modified with N-MOS like blocks and with small sized 
transistors are used in (Chang, 1996). The resulting circuit has small area and lower 
power consumption. Our implementation, with the E-TSPC technique and small 
sized transistors, resulted in the smallest area and the lowest power consumption; the 
speed, in addition, is comparable to (Huang, 1996). 




Figure 6 Transistor schematic of the approach. The transistor width/length, in 
^m, or only the width is indicated in the figure (in this case the length is 0.8pm). 

It is clear from the previous examples that the application of the composition rules 
and the N-MOS like blocks (the E-TSPC technique) can significantly improve the 
circuit characteristics on area, speed, and power. 

Table 3 Area, speed, and power consumption results for three different prescalers 
Prescaler Technology (pm) Area(lO'^mtn) Speed (GHz) Power (pW/MHz) 



(Huang, 96) 


1.0 


39.1 


1.6 


31.2 


(Chang, 96) 


0.8 


13.7 


1.22 


20.9 


(this work) 


0.8 


12.6 


1.61 


8.9 
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5 CONCLUSIONS 

The new Extended True Single-Phase-Clock (E-TSPC) CMOS circuit technique was 
presented and analysed. The E-TSPC technique represents an improvement in the 
following aspects: 

• the composition rules are little restrictive allowing the designer to take full 
advantage of the blocks potential; 

• the N-MOS like blocks can increase the speed of the circuit everywhere it is 
necessary. Moreover, since the conventional and N-MOS like blocks follow the 
same composition rules, they are exchangeable, and the cost of any substitution, 
in terms of time, is very small; 

• p-data chains can be designed to be as fast as n-data chains. In fact, p-data chains 
can use only N-MOS like p-latches, which have the logic implemented with n- 
transistors. 

Some design results show that speed increases and power/area savings are found 
with the E-TSPC technique application. A complete high-speed dual-modulus 
prescaler (divide-by- 128/1 29) was designed in a 0.8pm CMOS process, and 
simulations of the extracted circuits from layout reached 1.61 GHz with power 
consumption of 8.9pW/MHz. 
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Abstract 

This paper describes a new I/O buffer design, planed to allow low ringing at high 
frequencies at the presence of inductive package noise. Also, the buffer is 
configurable, so that the semi-custom designer can make a choice among power 
consumption, switching noise and delay. A reasonably simple and accurate model 
of an I/O buffer can be used to help semi-custom users foresee PAD behavior 
under different load and inductive packaging situations. Analysis results and 
simulations of the behavioral model were validated by a discrete prototype. The 
new buffer design was simulated, and reached higher frequencies without noise 
degradation when compared with traditional buffer design. 
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1 INTRODUCTION AND MOTIVATION 

During the design of a new Gate-Array matrix some developments were made 
regarding I/O buffers. Design trade-offs had to be made. Besides DC current 
capability ranging from 4 to 20 mA, a special request regarding power 
consumption programmability had to be achieved, so that a designer could choose 
between a faster PAD or a low power consumption one. However, design should 
not loose its efficiency regarding speed. I/O circuitry concentrates large transistors, 
and hence large currents and output capacitances are to be expected. Also, a high 
number of output pins among the total number of PADs is very important, 
considering total power dissipation on the chip. The possibility of having 
programmable PADs regarding their power-delay product was so very desirable. 

As the design proceeded, however, some parasitic effects like pin inductance 
showed that output capacitance was not the most important parameter, when one 
wants to reach medium to high frequencies. Not only power was affected, but the 
digital integrity of the output signal was threatened as well, caused by inductive 
noise. In order to maintain digital signal integrity, a new buffer architecture was 
developed, allowing fast switching speed but extremely low switching noise. 

This work tries to summarize the design relations regarding the size of the output 
stage and a mathematical model able to help digital designers discover how to 
choose the correct compromise among PAD speed, PAD power, output load 
capacitance, parasitic inductance and digital signal integrity. To compensate 
ringing at power lines, a new circuit architecture is proposed and validated with 
simulations. 

This paper is organized as follows: section 2 presents a study about power 
consumption in PADs, as a function of the delay and the parasitic power 
consumption. In section 3 the ground bounce problem is explained, and studied as 
a function of the switching speed of the PAD, the output capacitance, output 
transistor conductance and the parasitic inductance present in the chosen 
packaging. In section 4 a model able to help users foresee the behavior of the PAD 
is presented, and validated with experimental data. Section 5 presents the new 
PAD designed to cope with inductive noise, followed by simulation experiments. 
Finally, section 6 presents our conclusions. 



2 POWER AND PARASITIC POWER IN OUTPUT BUFFERS 

Total power dissipated in a system consisting of a pair buffer/load is divided 
between dynamic power (the one delivered to the load) and parasitic power, which 
is the one caused by the short-circuit current present when a certain input voltage 
causes both the output buffer transistors to be on. This current flowing between 
Vdd and Vss bypasses the load, and the duration of such charge displacement is a 




Noise and power programmability in semi-custom I/O buffers 



179 



function of the rise and fall time of the input, which is in turn dependent on the 
relative gate sizes between 2 consecutive buffer stages. 

Various studies are present in the literature ((Mead, 1980), (Veendrick, 1984), 
(Prunty,1992), (Choi 1994)), although some of them do not use realist output pad 
capacitances (above 50pF, for example). In order to study different trade-offs 
among power, delay and noise margins, various circuits were simulated with 
realistic capacitive loads. Using CL=50pF, f=20MHz and Vdd=5 Volts dynamic 
power reaches 25 mW, exactly 10 times the power used as an example in 
(Veendrick, 1984). 

Simulations with 1, 2, 3 and 4 stages buffers have been done, measuring power 
consumption, delay, rise and fall times. All buffers were simulated to 20MHz 
clock frequency, with a load capacitance of 50pF. In each simulation the tapering 
factor was changed, and as a consequence the number of stages also changed. 
Table 1 shows the results of such study. 



Table 1 - Buffers with 1,2,3 and 4 stages (simulation results) 





one stage 


two stages 


three stages four stages 


tap. fact. 


55.0 


7.40 


3.82 


2.72 


Tot.Pow.(mW) 


33.6 


29.3 


30.1 


31.8 


Din.Pow.(mW) 


25 


25 


25 


25 


Par.Pow.(mW) 


8.6 


4.3 


5.1 


6.8 


Delay(ns) 


7.25 


3.49 


3.26 


3.39 


Trise/fall(ns) 


5.90 


3.53 


3.42 


3.54 


Area(transist.)( pm^) 


17.3 


19.6 


23.0 


25.9 



Based on the results shown in table 1, we decided to adopt a 2 stage buffer. The 
conclusion of this first analysis is that, for circuits with realistic off-chip capacitive 
loads and frequencies, parasitic power reduction is not essential. Reduction in the 
number of stages allows area savings, since for a 4 stage buffers with e as the 
tapering factor total area is 25.904 pm^ in the used CMOS 1.2 technology, while a 
2 stage buffer using 7.4 as the tapering factor reduces area to 19.584 pm^ with a 
24% reduction in area, to an increase of only 0.1ns in the buffer delay. 
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3 THE PROBLEM OF PIN INDUCTANCE (GROUND-BOUNCE) 

The moment I/O circuits start to operate at higher frequencies, rise and fall times 
have to be shorter. Inevitably, the current variation over load capacitances in a 
given amount of time is also higher (di/dt). Output capacitances connected to an 
output PAD charge and discharge through VDD and VSS PADs. The di/dt caused 
by this charge movement will be translated to a voltage variation in the power 
supply pins of the circuit(VL=L*di/dt), resulting in a phenomenon known as 
ground-bounce (Johnson, 1993). 

According to the type of packaging used, different values of pin inductance will 
be present. For example, a 68 pin DIP presents 35nH inductance, while surface 
mount pins can have inductors as low as 7nH (Johnson, 1993). Practical 
measurements have shown inductance as high as 500nH for plastic pins packaging 
of the discrete transistors 4007, for example. 

Since the size of the output transistors has a determinant role in the ground- 
bounce problem, initially we have fixed their size according to the available output 
PAD area. The first stage of an output PAD was taken as a minimum size inverter 
from the Gate-Array matrix. Size of the output transistors is an important variable 
to define the tapering factor between stages and the overall area of the output 
buffer. Other variables are load capacitance, rise and fall time and the switching 
frequency of the buffer. 

To analyze noise problems originated by the ground-bounce we used a 2 stage 
buffer, adding I/O pin and power line inductors, as shown in Figure 1 (LI to L9). 
We supposed the use of a power couple (Vdd/Vss) to every 4 output pins driving 
maximum current. This way, the discharge current (the one that flows from the 
capacitors to ground) and the charge currents (from Vdd to the output capacitor) is 
distributed among multiple power pins. Also, power pins to the core and to the 
PAD ring are assumed to be separated. 

Initially, the circuit of Figure 1 was simulated with inductors LI and L9 equal to 
35nH, and load capacitors of 50pF. XI is an inverter build with basic transistors 
from the Gate -Array core, while X2 to X5 are two stage buffers as previously 
described. 

In Figure 2 one can see the waveforms of the output voltage over Cl. From top 
to bottom in this figure there are signals with frequencies of 5, 10 and 20 MHz 
respectively. In the sequence, for the same set of frequencies, we have diminished 
the size of the output transistors by half. It can be verified that in the first three 
waveforms (the ones with larger transistors), there is a huge over-oscillation 
superposed to the output signal, caused by the combination of reactive inductance 
and capacitance at that pin. This superposed oscillation is favored by the high Q of 
the oscillating circuit, allowed by the small saturation resistance of the output 
buffer transistors. 
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Figure 1 Simulated circuit 
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Figure 2 Voltage waveforms at Cl. l)f=5MHz; 2)f=10MHz; 3)f=20MHz; 
4,5,6)same frequencies, but half transistor sizes 



Still referring to Figure 2, in waveforms 4 to 6 a meaningful reduction in the 
superposed oscillation has been obtained. This reduction is achieved thanks to the 
reduction in the Q factor by the increase of output resistance of the output 
transistors. Also, smaller output transistors allow smaller output currents at the 
power bus connections inside the chip. For large output transistors simulations 
showed a peak current of almost 50mA, a value too large when one has to think 
about many output PADs switching at the same time. For small output transistors 
only 130mA were requested for a group of 4 PADs, against 200mA in the case of 
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large conductance transistors. At a frequency of 20MHz the output signal has a 
large amount of distortion, as shown in Figure 2. In Figure 3 one can see the output 
waveforms at 20MHz, but with smaller parasitic inductors (7nH). The first and 
second waves refer to a 50pF load, while the other were simulated with a lOOpF 
load. Waves one and three (top down) have larger output transistors, while the 
other have half the size of the other. In this case of low inductive packaging, 
buffers with half the maximum size can be used to charge loads up to 50pF, while 
for lOOpF or more the fastest transistor should be used, to avoid excessive output 
signal attenuation. 



OUrrERS OE 2 ESTAGIOS LP[N0-7NH r.POMHZ 




Figure 3 Output waveforms to 20 MHz input, 7nH parasitic inductor. l)Cl=50pF, 
2)Cl=50pF, W^^2; 3)Cl=100pF, 4)Cl=100pF, W^^2; 

The analysis of simulation results allows the formulation of a. design strategy at 
hand for the semi-custom user engineer. In order to diminish the amount of 
ground-bounce, one possible solution is to divide the output transistor in 2 smaller 
ones. By metalization the designer can choose between a faster PAD to highly 
capacitive loads or to a slower PAD to small capacitances with less danger of 
signal ringing. At the same load conditions, large transistors allow faster rise and 
fall times, at the price of a higher ground-bounce when high inductance packaging 
is used. In this last case, a series termination must be added to cancel out the 
parasitic oscillations. Using the smaller transistor pair will cause a degradation of 
nominal performance (delay, rise and fall time), but a safer circuit regarding 
oscillations. 
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4 BEHAVIORAL MODEL 

Since physical parameters like parasitic inductance and load capacitance can not be 
changed, the final decision of which PAD circuit to use (more resistive and slower, 
more conductive and more oscillation prone) must be taken by the final Gate- 
Array user, or by a CAD tool. In both cases a behavioral model is necessary, 
taking all design parameters into account and giving reliable results regarding the 
final circuit behavior under the desired load. 

A very simple second order model has been developed, shown in Figure 4. As in 
Figure 1, GND inductors present at the GND pin are also represented. The output 
transistor was substituted by a voltage source and series resistor, representing the 
output high to low transition and the output transistor resistance when in 
saturation. Only the high-low transition was taken into account, for the low-high 
one should be similar by assuming proper transistor sizing to assure equal noise 
margins. 

The model shown in Figure 4 has the advantages of simplicity and easiness of 
use, being rightly manipulated by the final user or the CAD system. The 
substitution of the transistor by a voltage source and a resistor might cause 
amplitude errors, for the transistor will move from the saturation to the linear 
region, increasing the time it takes to discharge the capacitance. These errors, 
however, should be small, given the large amplitude of the oscillations at the 
beginning of the transition, when the transistor is saturated. 




Figure 4 Second order model representing a PAD 

In order to develop the model we just verified that there are only two energy 
storing elements. Since it is much easier to work at the frequency domain in order 
to solve the differential equations present ate the circuit, we just compute the 
answer at the s-plane and applied the inverse transform. The only approximation 
was to consider a linear resistor, which is not critical, as already mentioned. 

The output voltage over the load can be expressed in the s-plane as the 
response of the transfer function to a voltage step, and one writes 
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Voutis) = ^ (1) 

(s + ar +(0^ 

This equation has a time domain equivalent given by 

Vout{t) = • sin{cot ) . (2) 

Equation (2) means that the response of the circuit is a sinusoid function, 
exponentially attenuated. This result is easily verified in Figures 2 and 3. The term 
coin equations (1) and (2) is the oscillation radian frequency, given by the parasitic 
inductor and the load, 1/SQRT(L*C). The term a is the attenuation factor, given by 
L/R. Again, this is coherent with Figures 2 and 3, where the increase in the output 
resistance or the decrease in packaging inductance greatly diminished oscillations. 

To validate the model presented in Figure 4 a circuit using discrete 4007 devices 
has been used. In order to carry the analysis in a meaningful way, we first 
developed a frequency scaling, so that all inductors and capacitors had their values 
multiplied by 1000, while the frequency was reduced by the same factor. This way, 
parasitic inductors and capacitors present in the 4007 packaging would not be 
meaningful to the desired measurement, in comparison with the relative large 
inductor placed outside the chip. The actual conductance of output devices of a 
given technology is not important, when one wants to compute the oscillation 
frequency. What really matters is the relative size of the series impedances on 
frequency, so that the exponential envelope attenuates the oscillations. 




- O 

- - - Simulated 



Figure 5 Comparison of theoretical and experimental data 
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Experimental results, shown in Figure 5, show good agreement with the 
proposed model. The difference between the proposed equation and reality was 
less than 9%, only when the amplitude of oscillations were small, caused by the 
change in transistor conductance when moving from saturation to the linear region. 

Another solution like a strobe signal to precharge the output to Vdd/2 before the 
switching occurs could be used (Kang, 1996). In this case, the amount of needed 
current is smaller, for the output is already at half the final value. Nevertheless, this 
type of solution has 2 drawbacks: the need of an internally generated strobe signal, 
highly timing dependent and tricky for the system designer to cope with. Also, 
total power would increase, for to maintain the output at half Vdd there must be a 
path from the power line to GND. 



5 NEW PAD DESIGN 

Although the semi-custom engineer can choose between a faster or more 
conservative PAD, the ideal situation would be the availability of a fast buffer 
without inductive noise. The goal is under research among different laboratories, 
and some results have been achieved, mainly for what regards transmission line 
terminations ((Sechler,1995), (Johnson, 1993)). There is not available, to the 
authors knowledge, a circuit able to cope with inductive noise, present at low cost 
packaging at medium to high speed designs. 

In order to solve the problem we considered the reasoning behind the proposed 
model. Since a resistive circuit greatly attenuates the noise, but slows down circuit 
performance, what one needs is a fast transistor that will allow fast transitions, and 
gradually this fast transistor should turn into a large resistor, increasing the 
attenuation factor and avoiding parasitic oscillations. 

One of the ways to achieve this goal was to use a self-timed buffer, that could 
turn itself off after the output signal had reached the desired voltage level. Instead 
of using self-timed logic, we relied on the analog nature of transistors. Figure 6 
shows the new proposed architecture (the n-part of the circuit, the p-part is just its 
dual). The main idea is the following: the new PAD should be fast enough to rise 
the output voltage up to Vt (threshold voltage) of the next stage, so as to start a 
new transition. From this point on, the PAD should immediately increase its 
resistance, so that all oscillations would disappear. 

The new PAD was developed following the proposed philosophy. The output 
transistor was split into two (MNinv and Ml), with a ratio of 4:1, the larger 
transistor being Ml. As it is shown in Figure 6, at the moment the input signal Vin 
is at logic 1, the output should be discharged. At the beginning of the transition, 
M3 is conducting, while M2 is off. Both Ml and MNinv are on, so the PAD is 
working with maximum conductance. As Vout goes down, M3 starts to cut-off, at 
the same time that M2 turns on. Since M2 will remove charges from the gate of 
Ml, Ml will increase its resistance, avoiding ringing. 
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. Cload 



Figure 6 New PAD architecture 

After the transition phase M3 will be completely off, and M2 on. Since Vds of 
M2 is equal to Vgs of Ml, and Vds of M2 can not be smaller than Vtp, Ml will 
always be at a small conduction value, but at the linear region, not in saturation, 
with a larger resistance. When a low to high transition at the output node is needed, 
it should be noticed that although Ml will be conducting until Vout reaches Vtp, 
the amount of lost power caused by parasitic currents is the same as in the normal 
output buffers, when both p and n transistors are on. 

Figure 7 presents simulation results of the proposed circuit, in a 1.2 CMOS 
technology, with parasitic inductors of 35nH and load capacitor of lOOpF, at 33 
MHz, considering a short transmission line connecting 2 chips. The output signal is 
taken after an hypothetical input inverter of the second chip. It should be noticed 
that the normal I/O buffer does not work any more (signal outln), for the next stage 
has some spurious transitions. On the other hand, the compensated buffer (signal 
outl) works fairly well, allowing the system to reach such frequencies. At low to 
medium frequencies, ringing problem is not so evident (di/dt is small), and the 
compensated buffer causes a delay of 1 to 2 nanoseconds. 



6 CONCLUSIONS 

The design of a low power PAD as proposed in (Veendrick, 1984) could not be 
achieved, since parasitic short-circuit current power savings are not meaningful 
when realistic output loads and frequencies are used, like 50pF and 10 to 20MHz. 

The design of an output buffer must take into account trade-offs between power 
dissipation and delay, besides rise and fall times. The choice for a faster PAD 
might not be the right one, because this can cause severe output oscillations when 
small loads are used. In this case, either the user slows down the system (in order 
to wait it to stabilize) or a series termination must be used to increase output 
resistance so as to attenuate the undesired oscillations. 
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Figure 7 Simulation results of the new PAD: L=35nH, C=75pF 

The trade-off between noise margins and speed is responsibility of the Gate- 
Array user, or system designer, taking into account the output load capacitance and 
the packaging inductance. In order to help the designer or to give data the 
supporting CAD, a simple second order model was proposed and validated, being 
quite effective to foresee circuit behavior. In the case when a fast PAD is chosen to 
drive small loads, severe oscillations might be present, and must be attenuated by a 
series resistor termination, that will affect rise and fall times. 

Alternatively, we proposed a new PAD architecture, with varying resistance, able 
to cope with fast transitions and allowing small inductive noise. The proposed 
PAD has reached higher frequencies than the traditional one, thanks to the 
elimination of ground bounce by the use of a variable conductance transistor. 
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ABSTRACT 

This work presents a pre-charge type DPLL that can be used as a clock recovery 
circuit in SDH/SONET communication protocols at 622 MHz. It is presented the 
phase/frequency detector which can operate equally at low and high frequencies. 
Each block of the DPLL is described, calculated and simulated. It is also presented 
the stability criteria and analysis of the DPLL and its final layout. The main feature 
of this work is the quick synchronization time of the DPLL due to the inclusion of 
an initialization circuit. 

Keywords: PLL, DPLL, PFD, VCO, Syncronization. 



1. INTRODUCTION 

In synchronous communication systems, the receiver should be able to recover the 
clock signal from the incoming message. This is required in order to sample and 
process the incoming data at the proper time. Therefore, the receiver should be able 
to have a perfect copy of the clock signal used by the transmitter. The circuit used to 
recover the clock signal is the clock recover circuit. 

The mostly used circuit to recover the clock signal is the Phase Locked Loop - 
PLL [Best,1993; Wolaver,1991]. The PLL compares the phase and/or frequency 
between the reference signal and a signal generated by the PLL circuit itself. If the 
reference signal and the self generated signal have the same phase/frequency, they 
are synchronized, and in this case, the signal generated by the PLL is used as a 
synchronization signal in the receiver. 
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This work presents an implemented Digital Phase Locked Loop - DPLL which 
has quick synchronization and it is implemented by a reduced number of transistors. 
This DPLL may be used in Synchronous Digital Hierarchy - SDH and Synchronous 
Optical Network - SONET network protocols operating at 622MHz. 



2. CHARGE PUMP DPLL 

The DPLL [Paemel,1994] is composed basically of four blocks, namely the Phase 
Frequency Detector - PFD, the charge pump circuit, the low-pass loop filter and the 
Voltage Controlled Oscilltor - VCO, as shown in Figure 1. 




The PFD is a sequential phase/frequency detector [Gardner, 1979]. The PFD has 
the following valid states as output: 

• Up= 0 and Down= 0 indicad high impedance state where Fn.f = Fvco* 

• Up= 0 and Down= 1 indicad Down pulse generated where F^f < Fvco or Gref 
leading Gyco- 

• Up= 1 and Down= 0 indicad Up pulse generated where Fref > Fvco or Gref 
laggingOvco. 

The PFD needs a charge pump circuit, as indicated in Figure 1. The charge pump 
is basically an electronic switch that injects (drains) a current Ip in(from) the loop 
filter, depending on the pulse generated by the PFD. The charge pump conducting 
interval to„ is given by: 



where 0e represents the phase/frequency between the reference and the VCO signals, 
and tOi„ represents the angular frequency of the incoming signal. 

The current injected (drained) by the charge pump circuit, in each cycle, is given 
by: 
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( 2 ) 



The gain of the charge pump PFD is given by: 




(3) 



The VCO must produce a signal that has an output frequency as function of the 
voltage produced by the loop filter and its should have the same phase and 
frequency of the reference signal. 

Since the signal angular frequency is given by the derivative of the signal on the 
time, them the VCO controlling equation is: 

+ (4) 

at 

where 0o(t) represents the phase of the VCO signal, cob represents the VCO angular 
frequency, and Ko represents the gain of theVCO, given in [rad.A^.s]. 

Therefore, from equation (4), the VCO frequency shift caused by the system 
controlling voltage is given by (in the frequency domain): 

= (5) 

The signal produced by the PFD is composed of a DC level, proportional to the 
phase/frequency error of the input signals, and by high frequency components. Since 
this signal is used as the VCO controlling voltage, the high frequency components 
must be to eliminated using a low pass loop filter [Gardner, 1980; Encinas,1993; 
Keese,1996]. The loop filter is also responsible to eliminate any ripple in the DPLL, 
which could take the VCO to saturation, forcing the DPLL to loose the 
synchronism. 

Figure 2 shows the DPLL block diagram. From the figure, the closed and open 
loop transfer functions are, respectively: 
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Figure 2 Block diagram of the DPLL in the frequency domain. 



3. PHASE/FREQUENCY DETECTOR 

The implemented PFD operates equally at low and high frequencies [Kodon,1995; 
Yuan, 1989], which does not occur with the conventional PFD [Sayver,1990]. Figure 
3 shows the circuit of the PFD which is composed of two blocks reponsible for the 
generation of the Up and Down signals. Each block is composed of two pre-load 
branches and a transistor for the RESET of the block. Each block also has an output 
inverter and a transistor responsible for the PRESET. 




Figure 3 PFD circuit. 

The principle of operation of the PFD can be observed considering the generation 
of an Up signal, as indicated in Figure 4. When the reference signal is low, the pre- 
charge transistor Mui conducts, and consequently sets the node (a) to high. Node (b) 
is also high due to the application of the PRESET signal (system initialization). 
When the reference signal makes a low-to-high transition, the transistor Mui cuts, 
which puts node (a) in high impedance. This turns on the transistors Mu6 and Mu?, 
forcing node (b) to low. 

At the low-to-high transition of the VCO signal, the transistors Mu 2 and Mu 3 
conduct and node (a) goes to low, and consequently the pre-charge transistor Mu 5 
conducts. This situation forces node (b) to high. The signal Up is the inverse of the 
signal at the node (b). It can be observed from Figure 4 that the Up signal 
corresponds to the phase difference between the VCO and the reference signals. 
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Figure 4 Generation of Up pulse. 

Figure 5 shows the HSPICE simulation of this circuit for input frequencies of 
lOOMHz, 300MHz, 625MHz and 900MHz. 




Figure 5 Generation of Up pulses by the PFD. 

The charge pump circuit used, shown in Figure 6, is a current switch controlled by 
the PFD. 




Figure 6 Charge pump circuit. 
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The charge pump has an initialization circuit connected at its output. This circuit 
places 2.5 V at the output of the charge pump during the initialization of the DPLL 
system. Since this signal is applied to the input of the VCO, it starts oscillating at its 
central oscillating frequency, and consequently the DPLL reduces drastically the 
synchronization time. 

The charge pump sources have been implemented using current mirrors adjusted 
to 500|liA. This value of the current has been established to load quickly the internal 
capacitances and to increase the gain of the DPLL circuit [Keese,1995]. Therefore, 
the synchronization time is also optimized. 



4 VOLTAGE CONTROLLED OSCILLATOR - VCO 

The voltage controlled oscillator is divided in two blocks, namely the reference 
circuit and the three stages ring oscillator. The VCO reference circuit is shown in 
Figure 7. 




Figure 7 VCO reference circuit 

The VCO reference circuit is used to convert the signal from the loop filter in two 
new signals to be used by the ring oscillator. Considering, initially, that Vc(t) is low, 
them the pMOS input branch (transistors Mri and Mr 2 ) conducts a current of 
600jiA. Therefore, the current mirror ECi starts to conduct, draining all the current 
(600pA) from the source formed by the transistors Mrs, Mr^ and Mr 7 . The output 
voltage pbias goes to high and nbias goes to low. This situation represents the VCO 
in its lowest frequency. 

In the inverse condition, where the VCO is at the highest frequency, the 
controlling voltage Vc(t) is high. This forces the nMOS input branch to conduct, and 
there will be a maximum current at it output. Therefore, the output voltage pbias 
goes to low and the nbias goes to high. 

Figure 8 shows the controlling voltages of the VCO reference circuit obtained 
from the HSPICE simulations. 
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Figure 8 Control voltage of the reference circuit. 

The VCO circuit shown in Figure 9 is composed of three stages of current mirrors. 
The circuit also has a differential amplifier at its output to square the signal 
generated by oscillator. 

The current mirrors ECi, EC3 and EC5 has a 1:1 current transfer relation, while the 
current mirrors EC2, EC4 and EC6 has a 1 :2 current transfer relation. The current 
relation of the mirrors EC2, EC4 and EC6 has been chosen in such a way that, during 
operation, they would drain half of the current generated by the transistors Mv4, Mv9 
and Mvi 4, respectively, but would drain the maximum current from the mirrors ECi, 
EC3 and EC5. 




Figure 9 VCO circuit. 

When the transistor Myi is conducting (low level at its gate), the current mirror ECi 
conducts, thus draining all the current generated by the transistor Mv4. Therefore, 
the current mirrors EC2 and EC3 are off. Since the current mirror EC3 is off, the 
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current in the transistor Myio (which is part of the current mirror EC4) is half of the 
current produced by the transistor Mv 9 . The current in EC4 (relation 1:2) and it turns 
on the current mirror EC5. The current produced by Mvm is drained by the current 
mirror EC5, and the current mirror EC6 cuts off. Therefore, the signal at the gate of 
the transistor Mvi goes to high, and the oscillation starts. 

The gain of the VCO set by Kq = (Af/AV) can be obtained using the VCO 
characteristic curve shown in Figure 10. 



Freq.(MHz) 




Figure 10 VCO characteristic curve. 



5 LOOP FILTER 

Figure 1 1 shows the passive second order loop filter designed. 



v,(t) _ 

R 
C 

GND — 

Figure 11 DPLL second order loop filter. 

It has been used a second order loop filter to reduce the ripple of the system, as it 
would occur if a first order loop filter had been used [Gardner, 1980; Meyr,1990]. 
The impedance of the second order loop filter is given by: 




Z,{s) 
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y b 



iT + 1 




( 8 ) 



where b = (C/Ci)+1, e t = RC. 

The open loop transfer function of the charge pump DPLL using a second order 
loop filter is: 
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A method of design of DPLL loop filter is to adopt values of cut frequency and 
phase margin, and them the values of the filter elements can be determined 
[Keese,1996]. 

Using Equation (9), the value of the T can be determined as a function of co. 
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The system phase margin, PM is; 
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The cut frequency can be obtained by taking the derivative of the phase margin by 
(0 and turning it equal to zero: 
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( 12 ) 




H. Meyr and G. Ascheid [Meyr,1990] also deducted an expression to obtain an 
expression to the DPLL cut frequency as a function of the VCO gain Kq, the charge 
pump output current Ip, and the value of the DPLL filter resistance R, given as: 



(0^ = Kq-^R\ 



Lt 

In 



b : 



(13) 



The phase margin of a DPLL system having a second order filter can be adjusted 
by the proper selection of the value of b. Table 1, obtained using MATLAB, 
presents a variation phase margin as a function of b. 
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Table 1 - Variation of Phase Margin as Function of b. 



b 6 8 10 12 14 16 18 20 

PM(Degrees) 45 51 54 57 60 62 63,47 64,82 



Therefore, the design of a loop filter may begin with an adequated value of phase 
margin, and them Equations (12) and (13) can be used to obtain de values of the 
filter components. 



6. LAYOUT AND SIMULATIONS 



The final layout of the DPLL circuit, implemented in ES2 0.7pm CMOS 
technology, is given in Figure 12. The final simulations, using HSPICE and 
considering the parasitic capacitances, are shown in Figure 13. 




Figure 12 DPLL final Layout 

...The DPLL open loop transfer function is given by: 
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(14) 



where =127,8|^.^j, =500(M). h=ll, T = /tC= 1,7.10-*, R=3K4{a) 

Then, the DPLL open loop transfer function (Equation 14) becomes: 



<^o(^) - 



3,3575+1,9751.10*^ 
2,6273.10””5^ + l,70.10'*s^ 



(15) 







Charge pump DPLL to operate at high frequencies 



199 



Figure 14 shows the Bode diagram of this transfer functions. 




Figure 13 Simulations of control voltage and synchronized signals. 




Figure 14 Bode diagram of the DPLL designed. 

For the system the phase margin (PM) is: 

PM = arg[Go(f» J] + 180" = 56,44" 

where tct is DPLL cut frequency. 

Since the phase margin is positive, the DPLL is stable, according to the criteria of 
Nyquist. 
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7. CONCLUSIONS 

This work presented the design of a DPLL that can be used as a clock recover 
circuit in systems operating at 622MHz under the SONET/SDH protocols. It has 
presented the design of each block of the DPLL. 

The main feature of the DPLL is its short synchronization time, which was 
possible due to the use of an initialization circuit in the charge pump and a high gain 
loop and a loop filter with a short time-constant. 
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Abstract 

This paper presents the implementation of the sampling technique by means of 
the coordinated delays' method, using the SCFL (Source Coupled FET Logic) 
logic in GaAs MESFET technology. This technique presents a high resolution 
and is based on controlling delays in the clock and in the data signal paths, by 
means of delay elements. The resolution is related to the difference in the delay 
in both paths. The delay elements are implemented by means of differential 
invertors, in SCFL logic. A sample circuit of 64 stages has been designed. Its 
operation has been simulated using the HSPICE program that resulted in a high 
resolution at a bit rate of 3.33 Gb/s. The circuit prototypes are being fabricated at 
Vitesse Semiconductor, using the H-GaAs III process with 0.6|Lim gate length. 
Measurement results “like maximum rate, resolution and jitter” will be presented 
at the meeting. 
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1 INTRODUCTION 

Sample circuits are used in many applications such as signal and clock recovery 
in data transmission systems, logic analyzers, measurement systems and many 
others. Higher resolutions and sampling rates are continuous goals for new 
generations and applications. Conventional circuits present limitations to sample 
high frequency signals, where enough samples per cycle have to be taken and 
setup and hold times also need to be taken into account. The appropriate 
distribution of the clock signal through the circuit is another design issue. To 
overcome these limitations two approaches can be adopted. One is to use high 
frequency device technology and the second is to use new circuit designs. 

GaAs MESFET technology has become a mature and a viable option for high 
frequency needs. So, this technology was chosen for the design of our high bit 
rate sampling circuit. 
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Figure 1 High Speed Digital Sampler: (a) circuits with delaying clock and data, 
(b) sampler operation. 
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New sampling circuit techniques have been proposed to improve resolution by 
controlling the sampling time interval. This has been done trough the introduction 
of a controlled delay in the data path (Kim, 1990) or in the clock path (Bazers, 
1992). It is shown that a better resolution can be obtained by means of 
introduction of simultaneous delays in both paths, data and clock, as illustrated in 
Figure 1 (Gray, 1994). The sampling time interval is given by the difference 
between the two delay elements, in the signal and clock paths. An additional 
advantage of the technique is that the sampling rate will be independent of the 
clock frequency. 

A better understanding of the technique is obtained through the following 



equations: 




Q.. = DJTJ 


(1) 


Q, = + A„j where i is an integer 


(2) 


Q,-^DJT,, + AprAJ. 
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The sampling interval is given by: 
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(4) 



where: 

- output of latch i 

Di(t) - digital wave signal at node i 

- clock period time 

- delay time of delay element i in signal path 
Apt - delay time of delay element i in clock path 
Tf - instant of sampling at node i. 

If for every i, = A^ and A^. = Ap, then the time difference between the 
delays of both elements will be constant with Ap^ = Ap - A^. The digital signal 
will be sampled with a theoretical resolution given by Ap^. The experimental 
minimum resolution can vary however at each node due to a spread of device 
parameter and due to variation of the operation ambient condition. This means 
that non linearities and non monotonicities in effective sampling times can occur 
(Gray, 1994). Special care has to be taken when drawing the circuit layout in 
order to minimize these effects. The resolution can be controlled and adjusted by 
means of adjusting the delay time of each element through an external signal. 

The given sampling circuit presents a maximum frequency limit due to the fact 
that the delay elements and the latches present a low pass filter behavior. This 
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means that a narrow input signal can not propagate through the sequence of 
stages of the circuit (Gray, 1994). The chosen GaAs MESFET technology and 
SCFL logic aim to extend the maximum signal frequency, compared to the 
previous CMOS technology circuit (Gray, 1994). This will allow the sampling of 
very narrow signal pulses. A second innovation is the use of differential inverters 
in both the delay stages as well as in the latches, which was not used in the 
CMOS case (Gray, 1994). 

In this paper, a study on the implementation of the high frequency and high 
resolution sampling circuit is presented. A 64 stage circuit is designed with the H- 
GaAs III P 0.6|Lim technology of Vitesse Semiconductor (Vitesse, 1993), accessed 
through the France CMP multi-project program. The HSPICE program is used to 
simulate the circuit (HSPICE, 1995). At the moment the prototype circuits are 
being fabricated. Measurement results will be presented at the conference. 



2 DESCRIPTION AND DESIGN OF THE SAMPLING CIRCUIT 



The sampling circuit is composed of three basic blocks, a delay unit for the 
clock signal, a delay unit for the data signal and the sampling latch unit. Figure 2 
shows a block diagram of one of the 64 identical stages of the circuit. 
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Figure 2 Block diagram of one stage of the sampler circuit. 

A first issue in the design is to choose the type of basic inverters. DCFL 
(Direct-Coupled FET Logic) is the most simple logic gate for MESFET circuits 
(Long, 1990). Two other alternative gates suggested by the foundry are the SCFL 
(Source-Coupled FET Logic) and the BDCFL (Buffered Direct Coupled FET 
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Logic). The DCFL and BDCFL gates present as advantages a lower power 
consumption and a smaller number of transistors. However, they suffer from high 
sensibility to threshold variations and from low fan-out capability (De Marco, 
1996). The proposed sampling circuit needs a good control of delays that may 
consequently not vary with process and ambient variations. This imposes the use 
of less sensitive gates to these variations, as is the case of the SCFL gates due to 
its differential nature (Vu, 1988; De Marco, 1996). These differential gates are 
also less susceptible to noise, which will result in smaller jitter. This on its turn is 
of most importance to give a small (good) resolution of the sampler. 
Additionally, it offers a high fan-out capability, leading to smaller delays 
associated to interconnect line capacitances. These advantages of the SCFL gates 
are however paid by it's associated larger number of transistors, larger total area 
and larger power consumption. 

Schematics of SCFL differential gates used as delay elements for the data 
signal and for the clock signal are shown in Figure 3 and Figure 4 respectively. 
One difference between them is the number of level shifter diodes at the output 
stage of the gates. As will be discussed below, the sample latch requires lower 
voltage levels for the clock signal than for the data signal. The resulting output 
swing of the gates is respectively 2.8V to 3.5V and 1.8V to 2.5V, for a V^p bias 
of 5V and with the gate bias terminals V^ and V^ connected to its respective 
transistor source. Figure 5 shows the transfer characteristic of the SCFL gate of 
Figure 3 (delay element of data signal). An average of 5mW of power is 
consumed by each gate. The device dimensions were designed in order to: a) 
assure all transistors operate in the saturation region, to keep low gate-drain 
capacitances and b) assure low enough gate-source bias on the transistors, to limit 
its forward gate current The output stage is included in each gate in order to 
adjust the appropriate output levels and to increase the fan-out driving capability. 
This last characteristic is adjusted by the width of the transistors of this output 
stage. 




Figure 3 Schematic diagram of SCFL gate used as delay element for data signal. 
Channel dimensions, width / length, are indicated together to each transistor. 




206 



Part Five High Speed Circuit Techniques 



The delay time of each gate can be adjusted by means of the applied biases at 
the transistor gates of the load transistors (Vq) and the current sources (Vp). The 
simulated results of delay times versus Vq and V^. are shown in Figure 6 for both 
gates of Figure 3 and Figure 4. These results show that the bias presents a 
higher influence than Vq on the delay tinle. This means that the delay depends 
mainly on the bias current trough the gate and less on the value of the load 
impedance. The results of Figure 6 also show that the delay of the clock signal 
invertors is larger than of the one of the data signal invertors. The larger delay of 
the clock signal gates is required in order to improve the resolution (Gray, 1994). 
Actually, the resolution (difference between both delays) can be adjusted by 
appropriately choosing the bias value of V,, and Vj.. This allows one to adjust it in 
accordance to the width of data signal. 




Figure 4 Schematic diagram of SCFL gate used as delay element for clock 
signal. Channel dimensions, width / length, are indicated together to each 
transistor. 




Figure 5 Transfer characteristic of the SCFL gate of Figure 3. 
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Figure 6 Simulated delay times versus bias and V^, for the gates of (a) Figure 
3 (data delay element) and (b) Figure 4 (clock delay element). 

A D type SCFL flip-flop was used for the sampling latch. The schematic of the 
flip-flop is shown in Figure 7. This flip-flop is known as HLO-FO (High-Speed 
Latching Operation Flip-Flop) and presents approximately a 30% improvement in 
speed compared to conventional latches (Murata, 1995). Its basic difference 
compared to conventional latches is that the read and latch stages have both 
different current paths. This allows an independent design for each to these 
stages. The use of this differential type of latch also avoids any possible 
metastability problem as described for the latch used in CMOS technology (Gray, 
1994). Additionally, less noise or jitter is expected in the circuit once both clock 
and data signals are coming from differential type of delay lines. 
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Master Utch Slave Latch 




Figure 7 Schematic of the SCFL D type sampling flip-flop latch, known 
as HLO-FF. 



3 CIRCUIT LAYOUT AND RESULTS 

One important consideration to implement the layout is to take care to balance 
the capacitances at the output nodes of the differential gates and to make them 
equal for all the delay elements. This is needed in order to produce equal delay 
intervals for all the elements to reduce non linearity effects. Figure 8 shows the 
layout of one complete stage of the circuit, including the sampling D flip-flop and 
the data and clock signal delay elements. Three interconnect metallization levels 
are employed for routing and for power supply. Decoupling capacitors are added 
to the bias power lines. Additional ground lines are added in order to shield bias 
lines and reduce induced noise to the circuit. The layout of the complete stage is 
optimized in order to facilitate the cascading of the 64 stages of the complete 
circuit. 

Figure 9 shows the layout of the complete test chip, including the 64 stage 
sampling circuit and two ring oscillators of 17 differential inverters. One ring 
oscillator is composed of inverters as used in the data signal delay elements 
(Figure 3) and the other one of inverters as used in the clock signal delay 
elements (Figure 4). The sampling circuit includes an output shift register of 64 
bits. This avoids the need to connect output pads to all the 64 stages of the 
sampling circuit and will facilitate its testing. This shift register will be controlled 
by external clock signal. The signal transfer from the sampling stages to the shift 
register is performed through pass transistors enabled also by an external pulse. 
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Figure 8 Layout of a single sampling stage of the circuit. 




Figure 9 Layout of the complete test chip, including the 64 stage sampling 
circuit. 
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HSPICE simulations of the circuit have been performed after layout parameter 
extraction, taking into account all parasitic interconnect line capacitances. Figure 
10 shown a data string flowing through the signal path at the output of stage 
numbers 1, 32 and 64. This shows that a 300 ps width pulse can be sampled. The 
resolution of the sample circuit can theoretically vary from 0 to 185 ps. 
Experimentally, the minimum resolution will be larger than 0 ps due to the jitter 
effect. The maximum resolution value is determined by the difference between 
the maximum delay in the clock signal and the minimum delay in the data signal. 
Table 1 summarizes the present results and compares them with the ones obtained 
previously in CMOS technology (Gray, 1994). 

An alternative design is presently being developed, aiming to reduce the chip 
area, improve the speed performance and reduce the power consumption at a low 
supply bias. This is being achieved by using the low-power enable-disable 
differential Logic (Ribas, 1996) and quasi-differentials switch flip-flop (Maeda, 
1996), where the power supply can be brought below l.OV. 



SAMPLER DATA STAGES 1. 32 AND 64 / 300PS PULSE 




Figure 10 Simulated sampled data signal with 300 ps width pulses, at output of 
stages number 1, 32 and 64. 
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Table 1 Design and performance of the sampler circuit in MESFET and its 
comparison to a previous CMOS circuit (Gray, 1994). 





MESFET 


CMOS 


Transistor(sampler) 


3420 


2624 


Die size 


4330|J,m X 2100pm 


1795pm X 3467pm 


Total I/O pins 


28 (sampler) 


58 


Process 


H-GaAs III P 0.6pm Vitesse 


MOSIS 1.2pm Nwell 


Max. bandwidth 


3.33 Gb/s 


1 Gb/s 


Resolution 


?-185ps 


25-250ps 


Power dissipation 


2.5 W at 3.3 Gb/s 


1 W at 1 Gb/s 



4 CONCLUSIONS 

The design of a 64 stage sampling circuit based on the coordinated delay 
method, using SCFL gates on GaAs MESFET technology has been presented. 
The GaAs MESFET technology offers high frequency while the use of SCFL 
logic assures a robust and a metastability free sampling circuit. The sampling 
resolution can vary from a minimum to be determined experimentally to a 
maximum of 185 ps and the bandwidth is 3.3 Gb/s. The important measurement 
results of the sampler (resolution, maximum rate and jitter) will be presented at 
the conference. 
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Abstract 

This paper presents a new general purpose accelerator based on the Kress ALU Airay 
in (&AA-ni) - a novel field pro^ammable ALU array (FPAA), which efficiently sup- 
ports high performance arithmetic computations by massive pipelining. The KrAA-III 
and its underlying concepts will be introduced as a generalization of systolic array struc- 
tural principles - illustrated by a simple real world computing example. To explain this 
approach for embedded accelerators the underlying novel machine paradigm, having 
bwn published earlier, is recalled as far as needed for comprehensibility. 

Keywords 

data sequencing, FPAA, FPL-based accelerator, generic address generation, 
image recognition, parallel memory technology 



1 INTRODUCTION 

The implementation of the idea of real world computing [RW97] requires supercomput- 
ing power within hand-held ^uipment. To achieve this highly parallel high performance 
hardware at low cost is required. By general purpose coarse granularity parallel reconfi- 
gurable accelerators such achievements become feasible — based on a novel kind of 
field-programmable devices, which provide high throughput by instruction level paral- 
lelism [HB96]. The summing up of very high design and re-design cost because of short 
product life times as e. g. in the consumer electronics market creates a desire for substan- 
tially reduced design cost. The new platform introduced by this paper provides a method 
of “software-only” (configware-only) implementation of add-on accelerators. 

FPGAs currently available commercially and mainly used to implement control and glue 
logic, use only single bit data paths. In contrast to this fine grain parallelism we face dif- 
ferent requirements for implementing computing elements. Although some FPGA man- 
ufacturers developed more feasible devices (e.g. Xilinx XC6200, Altera FLEXlOk, 
Atmel AT6000) they are far from meeting such requirements. A wider datapath (e.g. 16 
or 32 bite) is needed for arithmetic, relational, or other functions. Reconfiguration should 
be drastically faster and should dso be possible during runtime. Furthermore context 
switches have to be possible to change efficiently between several tasks during runtime. 
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In particular this is essential to provide the efficiency and flexibility needed for real world 
computing. Because of these requirements such a computing element goes more in the 
direction of processing elements (PEs) like used in systolic array implementation, in con- 
trast to FPGAs, which are similar to the ASIC approach. Therefore we advocate the new 
class of Field Programmable ALU Arrays (FPAAs) [HB96], which are loosely related to 
systolic arrays by being a highly area-efficient mesh-connected array through wiring by 
abutment and extensive use of multiple pipelining (fig. 3). Examples of FPAAs are the 
Programmable Arithmetic Devices For Dig ital Signal Processing (PADDI, [YR93]) or 
the reconfigurable Datapath Architecture (rDPA, [Kr96]). 

This paper presents a novel kind of FPAAs called Kress ALU Array III (KrAA-III, see 
also [HB97a]). Whereas KrAA-I has been unidirectional, KrAA-III is bidirectional and 
its rALUs have two ports at each side (fig. 8). Further a new universal accelerator hard- 
ware called Map-oriented Machine with Parallel Data Access (MoM-PDA) integrating 
such Kress FPAAs is introduced. This machine enables parallel memory access without 
any consistency problems and supports also interleaved memory access with burst op- 
tions. The proposed accelerator works according to the Xputer paradigm [AH94] and al- 
lows flexible use of different reconfigurable ALUs (rALUs). Because we use the KrAA- 
III as a rALU array, we also call this prototype Kress Machine. After the introduction of 
the hardware, a real world computing example from automotive control demonstrates 
the use of Xputer-based accelerators applying FPAAs. 



2 THE ARCHITECTURE OF THE NOVEL KRESS ALU ARRAY 

To support highly computing-intensive applications structurally programmable plat- 
forms are needed providing word level parallelism instead of the bit level parallelism 
like FPGAs. It has shown that on tasks with large data elements, fine-grained devices 
pay too much area for interconnect than coarser-grained devices. (For an extensive study 
on area utilization refer to [De96].) For area efficiency this new platform is suitable for 
full custom design, like known fi-om ASAPs (application specific array processors) op- 
erating in SIMD mode. But ASAPs are not structurally programmable and support only 
problems with completely regular data dependencies (fig. 4). 

Therefore FPAAs are as dense as ASAPs but each ’PE’ is programmed individually (fig. 
4). Each array element of the KrAA-III consists like the rDPA of a reconfigurable data- 
path units (rDPUs). (Kress has called his approach reconfigurable datapath array, rDPA, 
[Kr96].) In contrast to the original approach, also the local interconnects between the rD- 
PUs are programmable individually (figure 1). Furthermore the global interconnect be- 
tween all rDPUs is constructed hierarchically and enables parallel data transfers. For 
speed up of task switches a context switch mechanism is developed, that allows recon- 
figuration of unused layers during runtime. Other features have been adopted from the 
original rDPA [Kr96]. These are: 
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Figure 1 KrAA-III local routability. 
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• transparent scalability 

• dynamic partial reconfiguration 

• 32-bit datapath width 

• the rDPU can serve as routing and/or 
arithmetic operator 

• pipe-like asynchronous inter-rDPU 
communication 

• smart interface for data scheduling (data 
streams entering and leaving the FPAA) 

The Novel Hierarchical Global Routii^ 

Network. The Kress Array provides three 
levels of routing: routing inside a rDPU 
(figure 1 a and b, and figure 6), nearest 
neighbour routing (figure 1 c and d) and a 
global bus (fig. 1 e) running through an en- 
tire KrAA (fig. 2), to support parallel data 
in- and/or output the “switch” and dual in- 
puts Bus A and Bus B introduced with BCrAA-IH (fig. 2). The different levels of hierarchy 
are connected with reconfigurable switches. These switches allow to isolate lower levels 
and enable parallel inter rDPU communication of rDPUs which are not located side by 
side. Furthermore the switches allow to connect lower levels with only one bus to one of 
two parallel buses on the upper levels. Therefore parallel data access from outside the ar- 
ray is made possible. Thus no chip area is wasted for parallel routing resources on the 
lower levels. Figure 2 shows the KrAA-III chip with its inside global routing resources. 
The switch on the KrAA-in chip can connect each of the buses to each other. Outside the 
KrAA-in chip more hierarchy levels and more parallel buses are possible. As shown in 
figure 2 a maximum of three independent data transfers are possible inside one KrAA-IH 
chip without additional routing resources in comparison to the rDPA. Furthermore two 
parallel data transfers on the global bus from outside can be performed. 

The Improved Local Routing Scheme. In addition to the hierarchical global routing 
network the KrAA-III has nearest neighbour local routing capabilities. In contrast to the 
rDPA, which has only one unidirectional connection between two rDPUs, the KrAA-III 
has full-duplex connections. Therefore the rDPA can route data only in one direction. 
For the other direction data has to be routed on the global bus. With the ability of bidi- 
rectional data transfers the KrAA-in saves a lot of global bus cycles. Because all data 
paths are 32-bit, the local connections over chip-boundaries are serialized. This reduces 
the number of chip-pins dramatically. 

The New Structure of the Reconfigurable Datapath Unit. The original DPU designed 
by Kress [Kr96] has been developed further to the rDPU with higher functionality. All 
arithmetic operations of the language C can be executed on the rDPU. This is supported 
by a small register file. Additionally, routing operations can be performed in parallel. 
The configuration memory consists of four independent banks. Each bank holds a 
complete configuration for the rDPU. If the task manager (This is typically done by a 
runtime system running on the host computer.) decides to execute another task, a con- 
text switch is performed. That means that all rDPUs change the configuration memory 
banks. Therefore also the register file for data in the rDPU has to be implemented four 



Bus A Bus B local interconnect 




Figure 2 KrAA-in global routability. 
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times. Otherwise all data had to be stored before a context switch is performed. This 
would cause too high execution times or context switches were only possible if a task 
had finished. Obviously a stopped and switched-out task has to be finished later, de- 
pending on its priority. To gain a further speed up out of the independent configuration 
banks the configuration control and the channels for configuration data are independ- 
ent from each other. Thus the reconfiguration of the three idle banks can be performed 
in parallel to the calculations on the active bank. This is very important for the use as 
a general purpose accelerator, because calculations and configurations can be per- 
formed at the same time. Thus the configuration time is no longer a penalty for the ac- 
celerator. If configurations and calculations have to be performed sequentially the 
configuration time had to be added to the runtime and therefore some applications 
would not benefit from such an accelerator. A related hardware/software co-design 
framework (e.g. CoDe-X, [HB97a]) has to regard only the time of the first configura- 
tion and reconfiguration times that are longer than the execution times of predecessor 
tasks. Furthermore tasks executed several times (e.g. nested loops or recurrent func- 
tions) can stay in configuration banks permanently. 

Figure 6 illustrates the routing programmability of the proposed KrAA-III architecture, 
which provides non-multiplexed bidirectional interconnect resources between nearest 
neighbour PEs. The former rDPA provides only a single unidirectional 32-bit channel 
between nearest neighbour PEs. 

The solution illustrated by figure 6 permits to map highly irregular applications onto a 
regularly structured hardware platform. Figure 5 shows a mapping example based on the 
rDPA approach only. Eight equations expressed in C language (figure 5 a) are mapped 
onto the FPAA by the Datapath Synthesis System (DPSS, [Kr96], see figure 5 b). The 
result of this structural programming effort is the configured data path within the FPAA 
(figure 5 c: only internal data paths shown). The DPSS is a simulated annealing optimiz- 
er, which carries out placement and routing [Kr96]. This example also illustrates also 
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Figure 4 Kress Array (ID) - hyper generalization of the systolic array: it combines the 
area-efficiency and very high throughput of the systolic array with the universality of 
the so-called “von Neumann” macWne. 
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Figure 5 Application Example of an automatic and optimized KrAA Mapping. 

part of the role of the CoDe-X application compiler [HB97a]. 

A Compiler for Structural Software. Like from an ASAP, multiple data streams enter 
and leave the FPAA. To organize these data streams, Data Scheduling [Ha95] is used. 
Resource scheduling within an configuration is no more needed here, since this has been 
already done by the placement of the DPSS. A particular data scheduler has been imple- 
mented [Ha95], which is supported by the MoM-PDA (Map-oriented Machine with Par- 
allel Data Access) Xputer architecture platform and its smart memory interface. 

But also a better compiler front end is required. If a hardware expert is needed to config- 
ure accelerators (as it is still necessary for FPGA based accelerators), the problem de- 
scription is not really structural software. Structural software being really worth such a 
term would require a source notation like the C language and a compiler which automat- 
ically generates structural code from it. For such a new class of hardware platforms a 
completely new class of compilers is needed, which generates both, sequential and struc- 
tural code: partitioning compilers, which separate a source into cooperating structural 
code (for the accelerator) and sequential code segments (for the host). 

In such an environment parallelizing compilers require two levels of partitioning: host/ 
accelerator partitioning for optimizing performance (first level) and a structural/se- 
quential partitioning (second level) for optimizing the hardware/software trade-off of 
the Xputer resources. Furthermore the application development environment CoDe-X 




Figure 6 The rDPU-IIl architecture. 
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[HB97a] combines three programming paradigms into one more powerful approach: 
the control procedural paradigm reflected in C language features, the data-procedural 
paradigm realized in an optional language extension for specifying selected data-pro- 
cedural application parts executed faster by Xputer hardware and the structural pro- 
gramming paradigm for the reconfigurable Xputer hardware components (FPAA). 
Section 5 gives a computation-intensive real world computing example. But before, we 
introduce the target hardware paradigm utilizing the FPAA. Since already one KrAA- 
III chip supports parallel data streams a data sequencing hardware is required which sup- 
ports efficient parallel memory access. 



3 STRUCTURED DATA SEQUENCING 

In this section the data sequencing paradigm is 
introduced. Also the principles of the necessary 
hardware structures are described. Since many 
applications provide inherently structured data, 
whereas program code storage schemes are usu- 
ally unstructured, we prefer to explore data se- 
quencing applications. Since currently it is 
unrealistic to believe, that general purpose 
processing will switch to a new machine para- 
digm, we used the area of custom computing 
machines to try out experimental applications of 
the new sequencing methodology. Machines 
based on this new paradigm are also called Xputers [HB96]. 

The main difference between the data se- 
quencing machine paradigm and von Neu- 
mann machines is, that the computer is 
controlled by a data stream instead of an 
instruction stream (but it is not a data flow 
machine [HB96]). The program to be exe- 
cuted is determined by a configuration of 
the hardware. As there are no further in- 
structions at run time, only a data memory 
is required. This data memory is organized 2-dimensionally. At run time an address 
stream is generated by a data sequencer. The accessed data is passed from the data mem- 
ory through a smart interface to the reconfigurable ALU (rALU) and back. The smart in- 
terface optimizes and reduces memory accesses. It stores interim results and caches data 
needed several times. Figure 7 shows all necessary components and their interconnect. 
This principles are derived from the fact that many computation-intensive applications 
iterate the same operations over a large amount of data. Xputers accelerate diem by re- 
ducing the addressing overhead. All data needed for one computation step is held in the 
smart interface and can be accessed in parallel by the rALU. The rALU is implemented 
with the KrAA-III [Ha95]. Computations are performed by applying a configured com- 
plex operator on the data, called compound operator. 

This hardware structure has further the big advantage that, if the smart interface is inte- 
grated into the rALU, the rALU can be changed easily without modifying the whole ma- 
chine. Therefore other FPAA or FPGA based rALUs can be used. The residual control 
between data sequencer and rALU is only needed when the data stream has to be influ- 
enced by the result of previous computations. Even the whole software environment ex- 
cept the configuration code generation for the rALU stays the same, when the rALU is 




Figure 8 General data stream associated: 
a) unidirectional, b) bidirectional. 
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changed [HB97a]. 

To clarify how operations are executed the ex- 
ecution model for Xputers is pictured in 
figure 9. A large amount of input data is typi- 
cally organized in arrays (e.g. matrix, pictures) 
where the array elements are referenced as op- 
erands of a computation in a current iteration 
of a loop. These arrays can be mapped onto a 
2-dimensional organized memory, which is 
called data map. The part of the data memory 
which holds the data for the current iteration is 
determined by a so called scan window, which 
is a model for the smart interface holding a copy of the relevant data of the memory. 
Each position of the scan window is marked as read, write or read and write. The loca- 
tion of the scan window is determined by a designated point, called handle (see figure 9). 
Operations are performed by moving the scan window over the data map and applying 
the compound operator on the data in each step. Thus this movement of the scan window 
called scan pattern is the main control mechanism of an Xputer. Because of their regu- 
larity, scan patterns can be described by only a few parameters. 

In fact the execution model realizes a 
2-level data sequencing. In the first 
level with the position of the scan 
window all data for one iteration is in- 
dicated. On hardware level the posi- 
tion of the scan window is determined 
by the x- and y-addresses of the scan 
pattern. In the second level the data is 
sequenced from the scan window into 
the rALU and back. This is done for 
each position depending on the read/write labels. On hardware level the data sequencer 
computes physical memory addresses for each scan window position. On this level the 
data scheduling described in section 2 is performed. 

4 XPUTER DATA SEQUENCER HARDWARE 

The data sequencer is a specialized hardware for generic address generation with a small 
parameter set. This chapter explains the hardware realization and the underlying con- 
cepts. First, the physical memory organization for the 2-dimensional model is shown. 
The Memory Organization. Since we have 2-dimensional memory accessible through 
a 2-dimensional scan window the memory can easily be cut in slices assigning the rows 
to different memory banks (figure 10). With this organization, access to different rows 
in parallel is made possible. This is done at the second level of the 2-level data sequenc- 
ing. Depending on the handle position the hardware has to determine which row inside 
the scan window is assigned to which memory bank. There are n parallel memory banks 
possible but to simplify the explanation only 2 parallel memories are illustrated. Each of 
the parallel memory banks is physically assigned to one of the parallel KrAA-III buses 
(figure 2). 

The data map for a task can be any rectangle of any size. Several tasks of different data 
maps may be mapped onto the physical memory at the same time. A special hardware 
is required to avoid big areas of unused memory. Because the memory is organized 2- 
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Figure 10 Memory distribution. 
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dimensionally, it has to be mapped by the data sequencer to commercially available 1- 
dimensional memories. 

The Data Sequencer Hardware. The 
data sequencer hardware can be divid- 
ed into a central control unit and an ad- 
dress generation data path. This data 
path is a pipelined structure with two 
stages; the Handle Position Generator 
(HPG) and the Scan Window Genera- 
tor (SWG). Each stage of the pipeline 
performs one level of the 2-level data 
sequencing. The third part of the data 
path is a Memory Map (MemM) func- 
tion followed by the Burst Control 
Unit (ECU) which generates the mem- 
ory control lines. The complete struc- 
ture is pictured in figure 11. 

As illustrated in figure 1 1, the number 
of memory banks affects only the sec- 
ond pipeline stage. The handle posi- 
tion is the same for all parallel 
memories. Though the Memory Map 
function and the Burst Control Unit are instantiated for every memory bank, their struc- 
tures are not influenced. To have this structure scalable, an FPGA implementation is 
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Figure 11 The Data Sequencer Datapath. 



considered. 

As the KrAA-III structure provides con- 
text switching for multitasking applica- 
tions and for fast task switching through 
simultaneous configuration, the data se- 
quencer has to extent this functionality. 
This is because of large FPAAs can be 
configured with several small tasks and 
process them in parallel. For this the 
data sequencer datapath provides a deep 
stack to store the parameter sets of all 
active tasks running in parallel as well 
as of all tasks switched out on the idle 
context layers of the KrAA-III. The size 
of the stack benefits from the small pa- 
rameter sets required for generic ad- 
dress generation. The control of the 
tasks executed on the active layer of the 
KrAA-in is performed by the central 
control unit of the data sequencer. Con- 
text switches on the KrAA-III and the 
required change of the parameter sets 
inside the data sequencer datapath are 
initiated by the runtime system running 




on the host computer. 

The Handle Position Generator. The 
HPG performs the first level of the 2-level data sequencing. It provides two identical step- 
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per units for the x- and y-address generation [HB97b]. Further there is a context switcher 
unit (figure 11) which adds an offset address for each scan pattern. This provides the pos- 
sibility for several data maps at the same time in the physical memory. The entire stepper 
hardware is designed to store several parameter sets for nested or meshed scans and for 
independent scans running in parallel. Therefore the scan parameters have to be ex- 
changed very fast. Most of the parameters are constants which have only to be read. Only 
the actual position of the slider and an offset address for relative scans have to be stored 
as interim results. Therefore a parameter stack is required for each address generator. 
The Scan Window (jenerator. The position of the scan win- 
dow is determined by the handle position generated by the 
HPG. The SWG (figure 12) has to access all memory loca- 
tions inside the scan window (see figure 9 and figure 10). 

This is realized by adding offsets to the handle position. Fur- 
ther the flags for read and write operations have to be set. If 
the data memory supports burst read or write operations, this 
is initiated. Since several physical memories are accessed in 
parallel all these actions have to be done in parallel. All pa- 
rameters for second level of the 2-level data sequencing are 
provided by offset generators (figure 12). They are stored in a look-up tables and se- 
quenced in succession by a finite state machine. The look-up tables provide capacity for 
16 different scan windows at the same time. The switching between different tasks is in- 
itiated by the central control unit. Further the SWG has to determine which adder unit is 
assigned to which memory bank, depending on the handle position. Since the 2-dimen- 
sional memory is cut in rows, this can be easily done by evaluating the LSB of the y- 
address. The SWG pictured in figure 12 supports 2 parallel memories and has to be ex- 
tended for more parallel banks. 

The Memory Mapper. All memory banks are commercially available 1 -dimensional 
memories. In cases where the number of rows in the memory map exceeds the number 
of parallel memory banks, several rows have to be mapped onto one memory. If only 
one memory bank exists, all rows are mapped onto this bank. In that case only a 2-di- 
mensional visualization of the traditional 1 -dimensional memory is performed. To have 
several rows of the memory map mapped onto the same physical memory device the x- 
and y- addresses have to me merged to one physical memory address. To eliminate un- 
used memory areas in the physical memory the output addresses are shifted according 
to the exact size of the actual data map, i.e. not every application requires the complete 
address range of 16 bits. Often some leading bits of the x and y addresses are unused. 
This results in different data map sizes and shapes. Therefore simply merging the 16 bit 
X- and y-address to a 32 bit physical memory address would cause a waste of memory, 
as both 16 bit addresses do hardly exploit the 16 bits. The leading zeros of the x-address 
are the reason for this waste as can be seen in figure 13a. Therefore an additional shift 
operation is implemented in the Memory Mapper unit. The exact number of shifts de- 
pends on the application. It is known and fixed at compile time. If there are several tasks 
in the physical memory the context switcher in the HPG secures that there is no memory 
violation. Because the memories have to be accessed in parallel this hardware is required 
for each memory bank. 

The Burst Control Unit Since interleaved memory with burst options is supported an 
additional unit has to control the burst operations. The required signals for variable burst 
lengths are generated by the Burst Control Unit. Because the memories have to be ac- 
cessed in parallel this hardware is required for each memory bank. 
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5 A REAL WORLD COMPUTING APPLICATION: 

AUTOMOTIVE COLLISION DETECTION 

An embedded accelerator example for real-time automotive collision detection has 
been described in [Ha96]. Because of the new parallel memory access and context 
switch features of the data sequencer and the KrAA-III, this application gains a further 
speed-up. Therefore the improved version of this approach is introduced. 

The availability of high-dynamic- 
range CMOS image sensors (HDRC) 
with random access to the pixel array 
serves as a prerequisite to automotive 
applications like automatic collision 
prevention. A sensor with a high dy- 
namic range is necessary to avoid the 
saturation known from CCD cameras 
in the case when a car leaves a tunnel, 
where the sensors have to be able to 
detect an approaching car against the 
bright sunlight The real time analysis 

Figure 14 Automotive collision detection using 
more efficiently, if a random access to Xputer accelerator 
the sensor array allows to concentrate 

on the interesting portions of the image, which greatly reduces the data transfer rates re- 
quired between the sensor and the image processing hardware. The design of a hardware 
to perform such a collision detection can be seen in figure 14. 

Since the HDRC can be accessed like a regular memory device, its address lines are di- 
rectly connected to the data sequencer. The data output is connected to one of the KrAA- 
III buses. Though the KrAA-in buses are bidirectional they are pictured directed in 
figure 14 to illustrate the flow of data in this example. The second &AA-III bus is con- 
nected to the data memory. Because of the HDRC provides only one databus parallel 
memory access is only possible for parallel HDRC read and data memory write opera- 
tions. The according memory organization is illustrated in figure 15. This corresponds 
to the memory organization in section 4 where the second memory bank is replaced by 
the HDRC. 

The way operations are 
performed is as follows. 

The Xputer data se- 
quencer computes the ad- 
dress sequences for the 
HDRC and the data mem- 
ory. On the sequenced 
data the KrAA-HI per- 
forms an edge detection 
by applying a two-dimen- 
sional FIR filter with ap- 
propriate weights. The 
filter equation is shown in 
figure 16. The ky are the Figure 15 Memory organization for automotive collision 
coefficients of the filter detection example, 
and the Wy contain the 

data at the corresponding scan window positions (grey fields in figure 15). The result 
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wii is written back into the scan window (white field). This scan window position is 
placed in another row assigned to the data memory. Figure 17 shows the equation 
mapped into the KrAA-III. The input registers of the rDPUs named with wy and the out- 
put register named with w^ represent the scan window positions. In figure 17 the input 
registers named with ky represent the registers with coefficients of the FIR filter loaded 
at configuration. For more details see [Ha96]. 

The pixel array of the HDRC image sensors is accessed directly by the data sequencer 
(figure 14). First the whole pixel array is scanned with a video scan to detect the borders 
of the street. Dependent on their location data dependent scans are performed, which 
corresponds to local video scans along these borders. So only the relevant parts of the 
image are read. The KrAA-III also transfers data to the data sequencer for controlling 
these data dependent scans (residual control, figure 14 or figure 7) [AH94]. Always after 
5 frames of scanning the relevant street borders, a video scan over the complete image 
is performed by the data sequencer and the edge detection algorithm is applied. To proc- 
ess the complete image is necessary for identifying approaching cars etc. (See image 
analysis task of the host below.) 

Because of a different algorithm is required, a 
context switch is performed. As a result no re- 
configuration is required and no hardware re- 
sources have to stay idle. (One of this two 
points would be the case if a FPAA with no 
context switch capabilities were used.) The 
data memory is also accessible by the host for 
data exchanges with the embedded accelerator. After this image preprocessing the host 
CPU performs the image analysis tasks, where the edges have to be combined to objects. 
These are identified to detect obstacles, like an approaching car or the borders of the 
street, for example. Since these algorithms do not reveal address patterns known at com- 
pile time, they can be handled with less overhead by a von Neumann processor than by 
a data sequencer. If the car would go out of the borders (street), for example a controller 
connected to the steering mechanism of the car could correct the wrong way. Also if an 
approaching car is identified, this controller could react in a suitable way. 

T^e hardware for the image preproc- 
essing might be a FPAA like the 
rDPA [Kr96], the KrAA-III or the re- 
configurable data-driven multiproc- 
essor architecture proposed in 
[YR93]. Generally FPAAs are superi- 
or for the image preprocessing over 
special purpose hardware with regard 
to flexibility in the choice of algo- 
rithms to be run. Using the MoM- 
PDA (Map-oriented Machine with 
Parallel Data Access) and a host al- 
lows to perform the image analysis in 
a pipelined fashion, where the image acquisition, the image preprocessing, and the image 
analysis are done in parallel in different stages of the pipeline. This does not even in- 
crease the latency from the image acquisition to the collision detection, because all these 
steps would have to be done sequentially as well in a non-pipelined fashion, only at lower 
overall data rates or higher bandwidth requirements. The versatility of the scan patterns 
allows to run a wide variety of image preprocessing algorithms in an endless loop with- 
out requiring any download of parameters after the initial power-up configuration. 




Figure 17 Two dimensional FIR filter mapped 
into the KrAA-III. 
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6 CONCLUSIONS 

The use of the data driven Xputer architecture as an embedded reconfigurable accelera- 
tor for real world computing and many other applications has been presented. The data 
driven mechanism has been realized by a novel kind of sequencer hardware. This data 
sequencer is superior to the address generation units found in digital signal processors 
as well as other separately available address generation devices on the market (e.g. 
HSP45240 [Ha92]). It provides the support to run complete image and signal processing 
algorithms on structured data without requiring an update of parameters after an initial 
configuration, even though such an update is possible. Multiple tasks can be executed 
on a data sequencer. The usefulness of the novel universal accelerator approach has been 
illustrated by a real time application example of automatic protection a car from leaving 
the lane and collision prevention. 

The novel KrAA-III has been explained in detail and several new mechanisms (im- 
proved rDPU, context switch mechanism and hierarchically routing structure) for speed 
up have been introduced. Especially the context switch mechanism recommends the 
fcAA-III architecture because of the ability to keep several tasks resident simultaneous- 
ly. Furthermore reconfigurations of idling configuration layers can be performed in par- 
allel to calculations of an active layer. 

This general purpose accelerator approach promises high speed-up factors to a cheap 
price for a wide variety of computation intensive applications in real world computing and 
other embedded systems as well as in desk top and other in scientific computing [AH94]. 
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Abstract 

In this paper, a novel 108-bit conditional sum adder(CSA) with Energy 
Economized Pass-transistor Logic(EEPL) is proposed. A new architecture is 
adopted, which is composed of seven modularized 16-bit CSA's and two separated 
CARRY Generation Block. In order to obtain a high speed operation, the CARRY 
Generation Blocks are separated from the modularized CSA. Further, a design 
technique based on EEPL is proposed to reduce the power consumption. With 
0.65)uni single poly, triple metal 3.3V CMOS process, its operating speed is about 
4.93/15' and the power consumption is reduced in comparison with that of the 
conventional adder. 



Keywords 

Conditional Sum Adder (CSA), Energy Economized Pass-transistor Logic(EEPL) 
Separated CARRY Generation Block 



VLSI: Integrated Systems on Silicon R. Reis & L. Claesen (Eds.) 
© IFIP 1997 Published by Chapman & Hall 




228 



Part Six Application Specific DSP Architectures 



I. INTRODUCTION 

In recent years, higher-bit adders are key components which determine the system 
performance in the areas of digital signal processor(DSP), arithmetic logic 
unit(ALU), and so on. Therefore, the needs to design high speed and low power 
adders are becoming larger. 

In this paper, an 108-bit conditional sum adder (CSA) with Energy Economized 
Pass-transistor Logic(EEPL) is proposed (Song, 1996). Pass-transistor Logic 
becomes popular in designing of VLSI systems. Due to the characteristics of the 
high speed operation and low power consumption, it is now widely used (Yano, 
1996) (Sakurai, 1995). Based on EEPL, the proposed CSA has a high speed 
operation and a low power consumption. In order to obtain a high speed 
operation, a separated CARRY Generation Block is proposed and designed in this 
study. In comparison with the conventional adder (Ohkubo, 1995), it has a faster 
operating speed and lower power consumption. With 0.65/zm single poly, triple 
metal 3.3V CMOS process, its operating speed is about 4.95ns and the power 
consumption is reduced in comparison with that of the conventional adder. 

In Section II , a proposed architecture of the 108-bit CSA is discussed. We 
explain how it is designed and the advantage of the architecture. Further, the 
circuit design technique based on EEPL will be also discussed. In Section m, the 
separated CARRY Generation Block is described. The characteristics of the 
separated CARRY Generation Block is mainly discussed. In section IV, 
simulation results and other characteristics are described. Finally, we summarize 
the results in section V . 

n . ARCHITECTURE OF THE PROPOSED ADDER 

The proposed architecture of a Conditional Sum Adder(CSA) with a separated 
CARRY Generation Block is shown in Figure 1. It is composed of seven 
modularized 16-bit CSA's and two separated CARRY Generation Block. In order 
to obtain a high speed operation, the CARRY Generation Blocks are separated 
from the modularized CSA. Before the cany propagation of each modularized 
CSA is transferred to the final stage, the block cany propagation(BCi ,BCiB) of 
the separated CARRY Generation Block drives the final stage. It will be discussed 
later. 

Each modularized 16-bit CSA is composed of both Pre C&S Generation Block 
and SUM Generation Block. The role of Pre C&S Generation Block is generating 
both preliminary CARRY and SUM signals. In order to obtain the final results, it 
is advantageous to handle the preliminary signals. It will be discussed in this 
section. At the SUM Generation Block the desired results are obtained. Using 
some kinds of multiplexers, the output can be obtained in a short time. Based on 
this architecture, the total delay is reduced by about two MUX equivalent delay 
time in comparison with that of the conventional one (Ohkubo, 1995). 
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Figure 1 The Architecture of the 108-bit Conditional Sum Adder(CSA). It is 
composed of seven 16-bit modularized CSA's (16-b x 7 CSA) and two high speed 
Cany Generation Blocks. Before the carry propagation of each modularized CSA 
transfers to the final stage, the cany propagation of the high speed Cany 
Generation Block drives the end of stage. Therefore, the total delay is shorter by 
about two MUX equivalent delay than that of the conventional one. 
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Figure 2 Circuit Diagram of the Pre C&S Generation Block. CH is equivalent to 
the Propagate Signal of the two input and CL is equivalent to the Generate Signal 
of the two input. 

Figure 2 shows the circuit diagram of the Pre C&S Generation Block shown in 
Figure 1. In this block, the equations of each signal are as followings; 

SH = AB + AB (XNOR) 

SL = AB + AB iXOR) 

CL = G + P C_, = G = A - B 

CH = G + PC_i = G + P = P = A+B 

Except for the signals SH and SL, other signals are generated by EEPL. SH and 
SL are sent to the input of the SUM Generation Block, and the level of the signals 
are restored in this block. This is because it is efficient to connect level restoration 
circuit at the optimal positions (Song, 1996). Further, the metal routing is going 
to be complicated, if there exist so many level restoration circuits. 
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Figure 3 shows the block diagram of the SUM Generation Block. In the figure, 
the desired output is obtained from a principle of a simple CSA and proposed 
equations as followings 

SHj = Xj»( GJ_^ +Pj,fj_2) + Xj»( 

= Xj {when +/^_, =/^_, =1) 

= Xj {when G^_, =0) 



SLj — X j*{ Gj^^ Xj*{ Gj_, + Pj_xCj_f) 

= Xj { when Gj_^ = 1 ) 

= Xj {when Gj_,=0) 

Due to the characteristics of the conditional sum adder, the previous data 
located two stages before are able to be used now. Thus the operating speed is 
faster than that of the conventional one. This is one of the main idea in this 
paper. Figure 4 shows the circuit diagrams of each multiplexer used in Figure 3. 
MS means a multiplexer with single input, MSL means multiplexer with single 
input and level restoration circuit, MD means multiplexer with double inputs, and 
\n)L means multiplexer with double inputs and level restoration circuit. When 
the output of a multiplexer drives the next gate of a multiplexer, we have to use 
multiplexers with a level restoration circuit. This is because it is necessary to drive 
the gate input within a full range. However, when the output of a multiplexer just 
drives the input of a multiplexer, we do not need to connect a level restoration 
circuits. In the same way, the best optimized multiplexer is chosen and designed 
according to the role and location of each multiplexer as shown in Fig. 3. 

Fig.S shows the SPICE simulation results of each multiplexer. The delay time of 
MDL is faster than that of MSL. This is because the delay time of a level 
restoration circuit based on EEPL is faster than that of the circuit based on the 
simple inverter shown in Figure 4(b). Assuming the signals of CIN and CINB 
from Figure 1 are driven faster by the block of the separated CARRY Generation 
Block in comparison with the cany propagation of the modularized CSA, the 
propagation delay of 108-bit adder is equivalent to that of ten MDL in cascade. 
This is because there exists one equivalent MDL in Pre-C&S Generation Block, 
eight equivalent MDL in 16-bit CSA, and one MSL driven by CIN in the final 
stage. It is simply calculated in Figure 3. The critical path is composed of both 
eight vertical MDL and one MSL as shown in Figure 3. In the conventional CSA, 
the propagation delay time of the 108-bit adder is equivalent to that of twelve 
MDL in cascade (Ohkubo, I99S). Therefore, the propagation delay time of the 
proposed adder is reduced by about two equivalent MDL delay in comparison 
with that of the conventional adder. In section m, we describe the design of the 
separated CARRY Generation Block. 
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Figure 4 Circuit Diagram of Each Multiplexer. According to the location and 
role of each MUX, they were optimized, (a) MS(Multiplexer with Single input) 
(b)MSL(Multiplexer with Single input and Level restoration block) (c)MD 
(Multiplexer with Double input) (d)MDL (Multiplexer with Double input and 
Level restoration block) 




Figure 5 Comparison of Each Multiplexer. The delay time of MDL is faster than 
that of MSL 
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m. DESIGN OF THE SEPARATED CARRY GENERATION BLOCK 

Block diagram of the separated CARRY Generation Block is shown in Figure 6. It 
is designed in order to make the operating speed higher. It is composed of both 
CLAl and CLA2 whose circuit diagrams are shown in Figure 7 and Figure 8, 
respectively. CLAl is designed on the basis of the theory of group sum and group 
carry. Each CLAl is composed of 4-bit group. If we design CLAl with the 
conventional CMOS circuits shown in Figure 7, we can reduce the propagation 
delay by about two MUX equivalent time in comparison with that of the 
conventional one. After two CLAl is connected in cascade, the output of it is 
transferred to CLA2. CLA2 is composed of both CLAl and MDL as shown in 
Figure 8. Therefore, the total propagation delay time of the separated CARRY 
Generation Block is equivalent to three CLAl and two MDL. In other words, that 
is equivalent to eight MUX delay in series. In comparison with that of the SUM 
Generation Block, that of the separated CARRY Generation Block is faster by 
about two equivalent MUX delay. 



CIN 




Figure 6 Block Diagram of the Proposed High Speed CARRY Generation Block. 
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Figure 8 Block Diagram of the CLA2. It is composed of both CLAl and MDL. 



IV. RESULTS 

Figure 9 shows the SPICE simulation results of the proposed 108-bit adder. The 
delay of the cany propagation is faster than that of the sum propagation. The 
delay is about 4.95 ns. In the figure, BC7 means the final carry and X<107> 
means the final sum result of the 108-bit adder. As the signal of the carry 
propagation is coming faster, the critical path of the 108-bit CSA is shorter than 
that of the conventional adder. Further, the power consumption of the proposed 
adder is smaller than that of the conventional one. The SPICE model parameters 
and experimental results have been carried out with 0.65i!zin triple metal single 
poly 3.3V CMOS technology. 
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Figure 9 SPICE Simulation Results of the Proposed 108-bit Adder. The delay of 
the carry propagation is faster than that of the sum propagation. 



V . CONCLUSIONS 

A low power 108-bit Conditional Sum Adder with Energy Economized Pass- 
transistor Logic (EEPL) was proposed. It consumed less power in comparison 
with that of the conventional one. This was because the proposed adder was 
designed on the basis of EEPL. In order to obtain a high spe^ operation, an 
architecture with a separated high speed CARRY Generation Block was adopted. 
With 0.65Aon single-poly triple-metal CMOS process at 3.3V power supply, the 
propagation delay of the proposed adder was about 4.95ns. It will be an useful 
technique to design a high speed and low power digital signal processor and other 
applications. 
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Abstract 

We demonstrate the design of a Globally Asynchronous Locally Synchronous 
(GALS) Discrete Fourier Transform circuit. Each locally synchronous stage is 
sunounded by an “Asynchronous Wrapper” which provides an asynchronous 
interface to an otherwise synchronous circuit. Every locally synchronous (LS) 
region operates independently, eliminating problems of clock skew and allowing 
each region to run at its own clock speed. Metastability can never occur because an 
asynchronous handshake “stretches” or “pauses” the local clock until data has 
stabilised. When new data is not available for processing the local clock stretches, 
automatically preventing the LS block from consuming power. When new data 
does arrive, the clock starts directly in phase with the handshake without wasted 
synchronisation time. 

The internal DFT stages were designed using typical synchronous -techniques. 
We were therefore able to use VHDL to quickly compose and synthesise the circuit 
using industry standard tools. Most current asynchronous design methodologies 
require either manual design or complex specifications which become unwieldy as 
circuit size grows. Locally synchronous design allows us to take advantage of 
normal synchronous methods, reducing design time while providing a robust 
interface that delay-insensitively communicates with the environment. 
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1 INTRODUCTION 

Globally asynchronous locally synchronous circuits combine an asynchronous 
external interface with an internally clocked circuit. Instead of a fixed period clock 
the locally synchronous blocks use a variable length stretchable clock. These 
clocks, as presented elsewhere (Pechoucek, 1976) (Seitz, 1980) (Chapiro, 1984), 
normally behave just like a normal synchronous clock. However, the clock also has 
a stretch control which can prevent it from moving out of the current clock period. 
The stretch signal must be asserted synchronously, but can be released 
asynchronously. Once the stretch signal is released, the clock continues in its 
normal manner having simply been displaced outwards in time. 

The class of circuits presented here allow a collection of locally synchronous 
regions to communicate without fear of metastability. The clocks do not stretch 
over periods of metastability like those using Q-modules (Rosenberger et. al., 
1988), but instead stretch to actually prevent metastability from occurring 
(Pechoucek, 1976). Rather than, as in totally synchronous design, using double or 
triple buffering or slowing down clock speeds to increase a system's MTBF due to 
metastability, this allows a designer to run their circuits at maximum speed with an 
infinite metastability MTBF. 

While the theory behind these metastability-free circuits has been known for 
many years, a simple and coherent methodology for designing them has not been 
available. By using Extended-Burst-Mode specifications (Yun, Dill, 1993), we 
have created a small set of asynchronous building blocks. These blocks 
collectively form an asynchronous "wrapper," surrounding a locally synchronous 
circuit and making it externally appear as an asynchronous handshake circuit. 
Armed with these simple circuits, a designer is able to subdivide a globally 
synchronous circuit into locally synchronous regions, reducing problems of clock 
skew and improving modularity. Additionally, a totally asynchronous system can 
be implemented with some or all of the asynchronous modules created using 
locally synchronous circuits. 

Using this set of building blocks we are able to combine asynchronous and 
synchronous circuits simply and elegantly in a single system. The building blocks 
are small and compact and offer the possibility of communication between locally 
synchronous regions with low overheads. This extends the composability of 
asynchronous circuits into the synchronous domain. 

2 CORDIC SLIDING WINDOW DFT ALGORITHM 

The Coordinate Rotation Digital Computer (CORDIC) algorithm, first proposed by 
Voider (1959), rotates a vector through an angle 6 using only shift and add 
operations. This is accomplished by splitting the rotation angle 0 into a sequence of 
m subrotations of angles tan‘(2‘) where i=0...m-l for m bits of accuracy. One 
clock cycle is required for each subrotation and thus the entire rotation takes m 
cycles. After the rotation, the resulting vector also needs to be scaled which 
requires approximately m/4 extra clock cycles (Cavallaro, Luk, 1988). The 
equation below describes the basic CORDIC rotation operation. 
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X , . X , , COS0 - sin0 

, =R(0) where R(0)= . „ „ • rn 

y \ ' ^ [sin0 COS0J 

Kar and Rao (1996) have proposed a unified CORDIC algorithm for the 
evaluation of the sliding window DFT, DHT, DCT, and DST. We focus only on 
the DFT here for simplicity. The DFT of a window of N complex data elements 
x(i+7), x(i+N-I) is given below for each k=0...N-l. Note that the new 
DFT is based on the previous window where the new values p(i+N) and q(i-t-N) are 
added and the oldest values p(i) and q(i) are removed^ 

'Pui(lc)] ^ Pi (^) +P(i + N)~ p(i)l 

Gi+i (^)J I N iQi (k) Aq(i +N)~ q(i )\ ' (2) 

Equation 2 presents the DFT in the form of the CORDIC rotation above. For 
each stage k=0...N-l, a CORDIC rotation is performed with 6 = {Ink / N). The 
input vector to all stages is the same so this vector can be passed unmodified from 
one stage to the next as the sliding window progresses. 



3 ASYNCHRONOUS WRAPPER 



Our goal is to create a circuit where each locally synchronous stage of an N- 
element sliding window DFT communicates asynchronously with its environment. 
We choose to use a four-phase bundled data handshake protocol. This means that 
there are four events per handshake cycle: Req+, Acfc+, Req-, and Ack-, We adopt a 
late data- valid scheme (Peeters, 1996) whereby data is guaranteed to be valid when 
a Req- transition occurs and may change at any time after Ack-. 

To provide the asynchronous interface to the environment, we surround each 
locally synchronous module with an asynchronous wrapper (see Figure 1). The 
wrapper synchronises the asynchronous data to the local clock by stretching or 
pausing the clock. When the LS module requires a data exchange, one or more 
ports are selected with the Input and Output signals. This causes the port’s Stretch 
signal to rise until the handshake occurs. While Stretch is high, the Clock’\- 
transition is prevented until Stretch falls. Stretch can never shorten the clock 
period; if Stretch falls quickly, Clock+ will not occur until a normal clock period 
expires. 

Tlie clock is only stretched once to complete a full four-phase handshake on all 



selected input and output ports. 
This is possible because each 
port is actually an 
asynchronous finite state 
machine (AFSM) that can 
execute a complete handshake 
without directly using the 
clock. Data can therefore be 
exchanged on all ports on every 
clock cycle if desired. This is a 
considerable advantage of our 




Figure 1 LS module with asynchronous wrapper 
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method as compared to others such as the Pausible Clock Control circuit (Yun, 
Donohue, 1996) which requires two clock cycles for each four-phase handshake 
and can only respond to one port per cycle. For a circuit with N ports, their scheme 
requires 2xN clock cycles to exchange data on all ports, whereas our scheme can 
do the same in just one cycle. 

Stretchable Clock Lr v 

The stretchable clock required for this wrapper stretcfa«"4^ rfZ/ 
specification must produce a clock period that is L ^ytV 

consistently longer than the worst-case 2 StretoKle Clock 

combinational logic delay. Stretch’^- must arrive 

well before Clock-¥ would normally occur to successfully stretch the clock. Stretch 
must not rise at about the same time as Clock+ because disastrous glitches could 
occur on Clock, Stretch can only prevent the clock from rising and will not affect 
Clock outside of a narrow interval around Clock’¥. The circuit shown in Figure 2, 
an adaptation of a clock from Seitz (1980), is a very simple implementation that 
will satisfy these requirements. 

The stretchable clock must fire all of the internal registers in each clock cycle. To 
increase the drive strength of the clock to support a large fanout we use an inverter 
chain where each stage has a fixed output load to input toad ratio. The optimal 
ratio to minimise delay is e (Seitz, 1980). This loading will delay the arrival of 
Clock-k- by some loading value A(# loads). Assuming properly sized clock buffers, 
A is approximately proportional to the natural logarithm of the number of loads. 
For 32 internal registers, the clock buffer delay is about 10 minimum inverter; for 
1000 registers, the delay is nearly 20 inverter delays. This delay is a limiting factor 
in how fast the clock can respond to a change in the Stretch signal. One challenge 
we would like to meet is to effectively use this clock delay time for computation. 

We will call the clock signal that reaches internal registers Clock, while the 
unbuffered output at the stretchable clock source will be called Clock^^^. 



AckI = ReqI*StretchI + 
Input«AckI -f 
-.Latchl-Ackl 

Stretchl = ReqMnput + 
Input*-«AckI + 
LatchI«StretchI 



^ReqlLatchir 

Stretchl- 



= ^teqI*AckI 



Figure 3 Passive Input Port Specification 



Input+ ReqUHjUtchl- / AckI = ReqI*StrctchI + 

Input Fort ^ StretcU^ Ini?ut-Ackl + 

Most applications require input ports y jReqi+ -Latchi*Acki 

which wait for the environment to stretchi- stretchi+ stretchi=ReqMnput+ 

provid. an input request, mis is 1“' !SX:, 

called a passive iitpul port (van ©JTSH® in*i -aeivaai 

Berkel, 1993) in contrast to an active str,iM- 

input port which initiates a request Figure 3 Passive Input Port Specification 
and waits for data to be returned. 

While we have implementations for both passive and active input and output ports 
(Bormann, Cheung, 19%), we only present the passive input port used in the DFT 
here. 

A Reql+ transition may arrive at a passive input port at any time. The request 
will not be acknowledged, however, until the LS mc^ule asserts Input+. Since we 
are using a late data-valid bundling convention, the data is not guaranteed to be 
valid until after Reql-. Therefore Latchl+, the transition which triggers the input 
latches, must not occur until after Reql-. Furthermore, we must be certain that the 
time from when the input data is latched to the next Clock+ is greater than the 
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combinational logic delay of the first stage of logic. Finally, AckI- must not occur 
before the data is latched since after AckI- the data can change at any time. 

The extended-burst-mode specification in Figure 3 will satisfy our requirements. 
Recall that an extended-burst-mode circuit waits for all transitions in the input 
burst to occur before sending the output burst, but that the input transitions may 
arrive in any order (Yun, Dill, 1993). A # symbol indicates that the signal is a 
directed don’t-care; it may either remain at 0, monotonically change from 0 to 1, or 
remain at 1. Similarly, a ~ means that the signal may remain at 1, monotonically 
change from 1 to 0, or remain at 0. 

To ensure correct behaviour of the extended-burst-mode circuit, we must meet 
three conditions (Yun, Dill, 1993). The fundamental-mode environmental 
constraint requires that a new input burst must not begin until the machine has 
stabilised. This can be met as long as the environment does not respond to an 
output burst faster than the internal state machine can recover. The feedback delay 
requirement can be satisfied by the synthesis tool using the conservative 
unbounded wire delay model. We do not have to worry about the setup time 
requirement because we are not using any conditional signals in our specifications. 

Note that the problem of the clock buffer delay mentioned above also affects the 
input and output ports. The Latchl signal will be delayed by A(# input bits) which, 
for 32-bit data paths, is about 10 gate delays. Thus the data will be waiting at the 
input latch for at least this amount of time; it may wait longer if Reql arrives before 
Input. Since we know that the data will be valid for at least this amount of time, it 
is possible to insert combinational logic before the input port. If there are several 
input ports and we would like to compute a function of multiple DataFs we can 
still safely insert A(# input bits) of logic because the clock will not stop stretching 
until all Stretchl signals have 

f 1 1 Output+ UttchO- / 

1 alien . ^ntchO-t- ReqO^ 

Output Port 

Since the active output port 
initiates the handshake, ReqO+ can 
occur immediately in response to 
Output+. Output+ will also fire the ^ Active Output Port Specification 

output latches after A(# output bits). Data must be valid before LatchO^ but since 
we know the minimum delay from Outputs to LatchO+, we can use this time for 
computation as with the input port. 

We would ideally like to have the internal logic stages balanced to have the same 
maximum combinational delay. Since the delay from Output+ to LatchO+ is fixed, 
we can delay Output-^- so that LatchO+ lines up with the buffered Clock+ signal. 
Then the output stage can have the same amount of combinational logic as internal 
stages and the full clock cycle can be used. 

To synchronise Clock-k- and LatchO-^ we can wait to begin the output cycle until 
Clockgj^^+. Clock-¥ always follows Clock^J^c^ by A(# loads), so we cannot prevent 
Clocks from sending new data to the output port. The output port will not allow 
the current clock cycle to complete until a full handshake completes. Therefore 
there is no danger of losing the current output data value because there is nothing 




•^AckO- LatchO-^ 
/StretchO- 



ReqO = -tAckOReqO + 

Output^-'LatdiO-’YO 

AckO+ StretchO = AckO + ReqO + 
LatchO-i- / Output*-«LatchO-<YO 

ReqO- K)+ yo = AckOLatchO + 
Ou^t'YO 

LatdiO = Output*-iYO 
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that can prevent LatchO+ since the previous handshake has definitely completed 
when Clock^j^c^, 

Note that the port may produce valid data well in advance of ReqO- if the 
environment is slow to Ac^O+. This is acceptable using the late data-valid 
bundling convention; we only need to ensure that data is valid before ReqO- and 
does not change until after AckO-. 

Port Control 

For each clock cycle the port control is responsible for generating the correct set of 
Input and Output port commands. If an input handshake is to be performed in a 
cycle, Input+ occurs in response to Clock-. Output commands will usually be 
generated after Clock-\- as discussed above. The corresponding port’s Stretch signal 
will rise directly in response to the command and then fall when the handshake has 
been completed. The falling Stretch signal from each port clears the Input or 
Output command from the port control to prepare for the next cycle. 

Some LS modules may perform a full set of input and output handshakes on 
every cycle. The sliding window DFT algorithm, however, only needs to execute 
each input and output handshake every m cycles where m is the desired number of 
bits of accuracy in the result. Coming out of reset the DFT will require input data 
to begin computation. Thus each stage will be stretched until a request arrives on 
that stage’s input port. After m cycles, a new input vector is loaded and the old 
vector is output to the next DFT stage. After another k cycles, the scaled vector is 
sent out through the lower output port. Each of these handshakes will only occur 
every m cycles. 

Eliminating Metastability 

An essential requirement for any locally-synchronous circuit is to eliminate the 
possibility of failure due to metastability. In this section, we demonstrate how this 
circuit avoids metastability and the requirements to exclude this possibility in any 
locally-synchronous system. 

Whenever an asynchronous input is sampled by a synchronous system, there is a 
chance that the latched value may remain metastable for an unbounded period of 
time (Chaney, Molnar, 1973). This is because the input may be changing just as it 
is sampled and thus can be neither high nor low. 

Totally synchronous systems can be designed to reduce the risk of failure to an 
acceptable level, but can never eliminate the possibility entirely. One method 
employed is to use multiple stages of buffers on asynchronous inputs, seeking to 
prevent a metastable signal from propagating through all the stages and into the 
internal logic. Another solution is simply to slow down the clock, since the 
probability of failure reduces as the clock period increases. These methods increase 
circuit area and latency in the first instance and decrease throughput in the second. 
These costs go towards decreasing the risk, but never remove the threat entirely. 

Locally synchronous systems are a fundamental solution to the problem in that 
they remove the risk of metastability altogether (Pechoucek, 1976). The MTBF 
due to metastability increases to infinity, and we do not have to reduce our 
performance to reach the goal. 

The key to solving this problem is in the use of the stretchable clock. Recall that 
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during most of the clock cycle, Clock is unaffected by any changes to Stretch. It is 
only when the clock delay has expired and Clock+ is about to occur that the clock 
can be prevented from rising until Stretch falls. Any glitches or noise on the 
Stretch signal are filtered out by the stretchable clock during the remainder of the 
cycle. Even if we constructed a circuit where both the Stretchy- and Stretch- 
conditions could become true simultaneously, briefly causing Stretch to remain at 
an undefined value, the clock would be unaffected as long as this does not occur 
within a narrow time around Clock+. Therefore as long as we can restrict when 
Stretch-\- can occur, the rest of the circuit will be unaffected. The asynchronous 
wrapper only asserts Stretch^ in response to the port select signals, and since these 
only occur early in a clock cycle, anomalous behaviour will not occur on the clock 
signal. 

Any time a metastability-free locally synchronous circuit needs new input from 
the environment, the clock must stretch until the data is available. The LS circuit 
must commit to sample an input by raising the Stretch signal. We cannot construct 
a circuit that ‘polls* for an asynchronous input using this technique. While we 
prefer metastability-free asynchronous wrapper implementations, there is no 
reason that we cannot use a Q-module type polling input port if it is required 
(Rosenberger et. al., 1988). We call this element a Q- Port (Bormann, Cheung, 
1997). 



p(i+N)-p(i) 

q(i+N)-q(i) 
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Ack 
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Figure 5 Systolic locally synchronous sliding window DFT implementation 

(N=4) 



4 LOCALLY SYNCHRONOUS 
DFT IMPLEMENTATION 

The A^-element sliding window DFT 
algorithm we have implemented is 
composed of N Coordinate Rotation Digital 
Computer (CORDIC) stages. Each stage has 
its own stretchable clock and requires m 
clock cycles, one cycle for each bit of the 
input operands, to complete computation. 
The CORDIC algorithm requires that the 
resultant vector must be scaled after rotation, 
consuming an extra k cycles. However, since 
subsequent stages do not depend on the 
scaled vector, only the latency but not the 
throughput is affected by scaling. The input 




Figure 6 Asynchronous 
wrapper surrounding locally 
synchronous DFT stage 
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operands are unchanged from one stage to the next, and are passed after m cycles. 
TTierefore the total time for computation of the first complete DFT is Nxm+fc 
cycles, but subsequent updated windows will be available after only m cycles. 

Initially, all stages of the DFT are inactive and consume no power. When data 
enters the first stage from the left, the input port will signal the clock control to 
cease stretching and computation begins. For each of the m cycles of the CORDIC 
rotation, the clock is never stretched since no handshaking is required. After m 
clock cycles, the bl and b2 inputs are passed, unchanged, through the output port 
to the next stage. At this point the CORDIC rotation has completed, so new data is 
needed to continue. The clock may stretch whilst waiting for a new set of inputs 
from the left. As soon as new input data has arrived, the clock will resume and k 
cycles later, the scaled DFT vector will be passed through the lower output port. 



5 EVALUATION AND LIMITATIONS 

In this section we analyse the overhead an asynchronous wrapper imposes on a 
locally synchronous circuit. We evaluate the area penalties of the wrapper and the 
maximum clock speed of a locally synchronous region, as well as comparing the 
power consumption to a fully synchronous implementation. 

Area Overheads 

Each locally synchronous region must be surrounded by its own asynchronous 
wrapper. The wrapper consists of a local clock, a clock buffer, and a collection of 
input and output ports. 

Each port has a set of latches and one AFSM. The input port AFSM requires just 
nine 2 and 3 input gates. The output port AFSM is also made up of only nine gates. 
The latches would be required even if the circuit was implemented with a normal 
fully synchronous methodology, so we do not include them as a wrapper overhead. 
Therefore the number of gates added by the ports is just nine times the total 
number of input and output ports. 

This does not include the buffer circuit to increase the drive strength of the Latch 
signal to drive the port latches. This area would normally be included in the global 
clock buffer circuit which is unnecessary in a LS circuit. For a 32-bit port, a fast 
buffer will have 3 inverter stages. Each inverter is e times larger than the last, so 
the inverters will have sizes e, and («20) times the minimum size inverter. 

Since the number of buffer stages only increases with the log of the number of 
loads, the clock buffer will only require about one extra stage than the Latch signal 
buffer for each CORDIC DFT block. 

Buffer Delay 

For 32-bit data, the input buffer delay will be approximately 10 inverter delays 
from ReqL to LatchI-\-. Since we have made no effort to put combinational logic 
outside of the input ports for this design, this time is actually wasted on each input 
to the LS DFT. However, we only have to pay this penalty when data is input into 
the LS circuit. Since new data is only read into the LS circuit every m cycles (m = 
# of bits) for a CORDIC rotation, the overhead is about 0.3 inverter delays per 
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clock cycle. The total latency of the DFT will increase the input buffer delay times 
the sliding window size N. 

Power 

A locally synchronous circuit will not consume power when there is no data 
available for processing. This is also possible in fully synchronous designs with the 
use of clock gating. However, unlike a clock gated synchronous circuit, the LS 
implementation will not even consume power in the clock or clock buffers when 
inactive. This may be a significant savings in high speed designs. 

The Pausible Clock Control circuit proposed by Yun and Donohue (1996) 
samples the handshake signals instead of stopping and waiting for a request to 
arrive. Thus even when there is no data available for processing, their circuit will 
consume the same amount of power. 

Maximum Clock Speed 

When determining the maximum clock speed for a locally synchronous circuit it is 
necessary to take into account not just the worst-case combinational logic delays 
but also any delays imposed by the asynchronous wrapper. Note that it is possible 
to have a nominal clock period that is shorter than the period for cycles in which 
handshaking occurs. However, we expect that most circuits will exchange data 
with the environment in a high proportion of clock cycles. If that is the case, most 
clock cycles will be longer than the worst-case combinational logic delay, thus 
wasting computation time unnecessarily. 

For every clock cycle that handshakes with the environment, the maximum clock 
period will be determined by the time it takes to complete the handshake plus the 
delay imposed by the clock buffer. For the moment we assume that the output port 
of one LS module connects directly to the input port of the next. For an input 
cycle, the delay from Input-^- to Stretch- can be as little as 7 gate delays. For an 
output cycle, the current implementation of the output port requires that StretchO 
is high for the entire handshake. This includes the delay introduced by the LatchO 
buffer as well as the time required by the receiving input port to latch the data. 
Therefore from Output+ to Stretch- delay is summarised in Table 1. 



Table 1 Time from the start of an I/O cycle to Stretch- 



Passive Input Port Delay 


Active Output Port Delay 


Input-¥ ->i4c/:/+ 


= 2 


Output-^ —> ReqO-¥ = 2 
ReqO+ -^AckO^- = 2 
AckO-^ —> ReqO- = A(out) - 4 


Ackl+ ->ReqI- 


= 2 


ReqO- -^AckO- = A(in) 


+ Reql- -> Stretch- 


= 3 


+ AckO- Stretch-, = 3 




= 7 gate delays 


= 23 gate delays 



This delay is 23 gate delays assuming the loading delay for both the input and 
output ports is 10 gate delays. We are making the conservative assumption that the 
inverter delays in the buffers are as long as the gate delays of the AFSM. The clock 
period must include the clock buffer delay as well, so the period in this case would 
be approximately 33 gate delays. 
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Note that with our asynchronous wrapper, the number of ports per LS module only 
affects the clock speed by increasing the fan-in of the OR-gate driving Stretch. 
This is in sharp contrast with the Pausible Clock Control method (Yun, Donohue, 
1996) which requires a tree of arbiters to merge all incoming handshake signals. 
As well as adding the area of one arbiter per port, this will increase the response 
time of the circuit. Using the arbiter delay given in their paper, the delay from an 
incoming request to an acknowledge will increase by approximately 1 ns per port 
for both the Req-^->Ack-^ and the retum-to-zero phase of each handshake. 

Timing Assumptions 

Locally synchronous systems strongly rely on bounded-delay assumptions about 
circuit elements. First we have assumed that a bundled data protocol can be met. 
More critically, we are relying on a circuit delay element to produce a clock period 
which is consistently long enough to allow combinational worst-case timing 
requirements to be met. 

Since these are both one-sided requirements, they should not prove difficult to 
satisfy. Furthermore, if worst-case combinational delays are unacceptable for some 
part of a design, this methodology allows asynchronous completion detection logic 
to stretch the clock if necessary. 

We cannot allow Stretch^ to occur within about one gate delay of Clock^^c^. If 
Stretch+ is allowed to happen during this period it is possible for the clock to glitch 
causing the circuit to fail. This is also not difficult to satisfy by firing Inputs and 
Output-^ in response to Clock^^^ transitions. Stretch+ takes three gate delays from 
either Inputs- or Outputs and thus will never cause glitches on the clock. 

6 CONCLUSIONS 

Some excellent work has been published on locally synchronous circuits in the 
past, most notably in Chapiro’s 1984 thesis. Using Chapiro’s “Escapement” 
circuits or the more recent Pausible Clock Control circuit (Yun, Donohue, 1996), 
however, it is necessary to stretch for each incoming handshake transition. This 
means that a four-phase handshake protocol needs to stretch the clock twice to 
complete, requiring a minimum of two full clock cycles. Furthermore, even when 
using two-phase handshaking. Escapement systems need to wait for all requests to 
arrive before acknowledging any of them. Pausible Clock Control circuits have the 
additional penalty of requiring arbitration between all incoming handshake signals 
and can only respond to one port request per clock cycle. 

We have overcome these concurrency problems through the use of asynchronous 
state machines. Using extended-burst-mode specifications, we were able to design 
modules to complete asynchronous handshaking outside of the clock. This ensures 
that the clock only needs to run when actual computation is taking place, and that 
the clock does not stretch unnecessarily. 

Q-modules are another alternative for globally asynchronous locally synchronous 
design (Rosenberger et. al., 1988). This method is similar to that used by the 
Pausible Clock Control circuits. We prefer our method for severe reasons. First, 
we do not require specially designed circuits for detecting metastability after 
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latching signals. Second, since Q-modules rely exclusively on polling external 
signals, they cannot respond directly to handsh^e requests and may require up to 
one full clock cycle to realise that a new request has arrived. Finally, this type of 
busy-wait behaviour consumes power unnecessarily, the internal clock running at 
full speed even when no new data is available. 

Globally asynchronous locally synchronous design offers many of the 
advantages typically cited by totally asynchronous circuits. The clock signal does 
not need to be distributed globally, and thus clock skew becomes much less of a 
problem. Furthermore, by using request-acknowledge signalling instead of relying 
on global timing constraints, these circuits are more modular and composable than 
totally synchronous designs. 

LS blocks only consume power when it is needed, stretching the clock at all 
other times. However, rather than the block itself waiting for an input as in 
asynchronous logic, it is the clock that is waiting. This means that unlike 
synchronous techniques such as clock gating, even the clock will not be drawing 
power when the circuit is inactive. 

Asynchronous circuits are often claimed to be faster than their synchronous 
counterparts. This advantage is gained in part because synchronous clocks need to 
be slowed down to accommodate worst case timing constraints. Worst case timing 
incorporates factors such as clock skew, environmental sensitivity, process 
variations, and completion time variance. These locally synchronous systems 
suffer less from clock skew since the clock is less widely distributed. Furthermore, 
since the local clocks are fabricated within the LS region, the clock speed will 
track environmental and process variations (Dean, 1992). Finally, the use of 
request/acknowledge signalling allows completion detection and reduced clock 
periods for quickly completing computations. 

Using VHDL we have been able to quickly design a circuit which can be 
optimised using the best industry standard tools available. We are able to leverage 
the synchronous design techniques that have been developed over the past 50 
years, and yet still gain many asynchronous benefits. Industry may be more 
receptive to the asynchronous design world when they can preserve so much of 
their past investment. Finally, locally synchronous design offers the possibility of 
combining asynchronous and synchronous circuits simply and elegantly in a single 
system. This extends the composability of asynchronous circuits into the 
synchronous domain. 
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Abstract 

The CORDIC algorithm has been widely used as a powerful and flexible generic 
architecture to implement many algorithms involving non-trivial arithmetic. 
However, when using its fastest, i.e. unfolded, implementation it exhibits 
excessive silicon area demands. Exploiting some peculiarities of the algorithm 
permits simpler hardware structures in the unfolded case yielding a substantial 
area reduction. Thus, power consumption is decreased, too. These reductions 
have no speed penalty. The benefits of the improved architecture have been 
verified by developing VHDL models and synthesizing sample layouts for 
comparison purposes. 



Keywords 
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Digital signal processing. Low Power Design, Parallel architectures 



1. INTRODUCTION 

For several applications CORDIC processing units have been shown to deliver 
superior performance when compared with more conventional approaches. This is 
due to the fact that many advanced algorithms can be interpreted as generalized 
vector rotations, for which CORDIC is especially suited to. 

Considering the implementation part of signal processing systems, the main 
shortcoming of CORDIC-based pipeline or array architectures, i.e. so called 
unfolded architectures, is their increased hardware complexity. In this paper we 
address CORDIC "s hardware complexity and describe architectures with 
considerably reduced chip area requirements. It will be shown that this area 
shrinking is obtained with no loss in speed. 
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2. REDUNDANT CORDIC ALGORITHMS 



The CORDIC algorithm is defined by recurrences in three coordinates /!/: 

^i+i =yi + <yi2-^’"''‘*x^ 



z,^, =z,.-CT,a„ 






( 1 ) 

( 2 ) 

(3) 



where m denotes the coordinate system (m=l means circular, m=0 linear, m=-l 
hyperbolic), S(m,i) the shift sequence 5(0,/)=5(l, 0=0,1, ..,n and 5(-l,0 
=l,2,3,4,4,5,..,3/,3/+l,3/+l,3/+2,..n, a^- is called the rotation angle, <j. denotes 

the rotation direction, usually a^e {-1,1} in non-redundant number systems and 
O) G (-1,01 } in redundant systems (but o;- = 0 has to be avoided, cf. below). The 

precision in bit is given by n while N means the number of iterations with N = n 
for m = 0 and m = 1 (for m = -1 some extra iterations are necessary, see 5(m,0). 
The rotation angle depends on S(m,i) according to 

tanh'*(2’‘^("^’*))for m=-l 
' tan *(2‘‘^(^»^)) for m=l 
l2-^(0*0 form=0 
Two operational modes are possible, rotation or vectoring. The rotation direction 
factor O', is determined by the following equation: 

[ sign(z-) for 0 (i.e. rotation) 

” |-sign(x. y .) for y . 0 (i.e. vectoring). 

where sign(A) = 1 for A > 0 , else sign(A) = -1. The algorithm converges for all 
input data inside the region of convergence, given by 

- 7 =tan-i('s/m>'(/A:o) for > -4 0 
‘\Jm 



(4) 



A^-l 



,=S' 

/=0 



m,i 



( 6 ) 






for 



The solution of the recurrences (1-3) is given in Table I with 
N-i 

11(1+'” a. 2-^s(m.i)yn ( 7 ) 

i=0 

being the scaling factor and Xq, y^, and Zq the starting values of the iterations. 

Equation (5) is only valid for non-redundant addition schemes. As these are 
carry-dependent and, consequently, inherently slow, some recently described 
proposals use redundant adders like carry-save 111 or redundant binary 73,4,5,6/ 
adders. However, while delivering superior speed due to their carry-independent 
nature, redundant addition schemes introduce certain algorithmic difficulties. The 




Unfolded redundant CORDIC VLSI architecture 



253 



sign of a number can not be easily derived from the sign bit in the leftmost bit 
position as in the two’s complement number system. In fact, in redundant 
schemes the most significant digit could equal zero. This would suggest choosing 
<j. = 0, but this choice is commonly prohibited due to the inevitable variation of 
the otherwise fixed scaling factor thus loosing all benefits of easy scaling 
factor compensation 161. So further bits, at worst all bits, would have to be 
inspected for sign evaluation, loosing the speed advantage. 

To overcome this conflict most authors 72,3,4,6/ utilize the approach found in fast 
SRT-di vision 777: some most significant digits of the number are inspected and 

this estimate is used to 
determine an appropri- 
ate or. Provided the 

absolute value of the 
estimate is above a 
specific margin, cr. is 
set to +1 or -1, 
respectively, otherwise 
to 0. This choice can 
disturb the convergen- 
ce behavior of the 
algorithm due to 
possibly wrong rotation 
direction. As has been shown in 72,3,4,6/ this can be compensated by simple 
iteration doubling after a specific number of microrotations. 

The number of iteration doublings linearly depends on the number of inspected 
digits 73,4,67. For p inspected digits each (p-l)th iteration has to be repeated. 

An alternative approach to keep the scaling factor constant is given in 78/ by 
using a reformulation of the CORDIC algorithm and computing absolute values. 
This idea is an adaptation of an algorithm first described to speed up the add- 
compare-select loop in Viterbi decoders. Unfortunately, it nearly doubles the 
amount of registers in the pipeline, thus significantly increasing chip area and 
latency. Therefore, this idea can not be used when chip area is the primary issue. 
In the following we build upon the iteration doubling method 72,3,4,67. 

Due to possible finite word length effects, angle errors and overflow prevention, 
we need a unified machine word length of = n-k-g^+o^+l = n+log2(n)+3 bit 
for the X and y datapath and = n+g^+o^+l= n+log2(n/3)+4 bit for the z data- 
path. A more detailed discussion of CORDIC "s inherent quantization errors can 
be found in 797. 





0 (rotation) 


0 (vectoring) 


m = -l 


= ^-i(^o^osh(zo)+>'osinh(zo)) 
yrr ^.i(^o^osh(zQ)+>'osinh(zQ)) 


^n = *-lVv>’0 

Zn = zo+tanh'^CVQ/xo) 


m = 0 


*n = ^0 


o 

II II 


m= 1 


y„= (yo cos(zo)+^o 


^« = ^ia/v>'o 

z„ = ZO+tan-‘o-o/V j 



Table I: CORDIC functions 
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3. SILICON AREA CONSIDERATIONS 

A typical architecture that implements the recurrences (1-3) in an unfolded 
spatial array manner maps each iteration to one row of the array. As we assume N 
= n iterations and three datapaths with an internal word length of at least n bit 
(see above) the area complexity of a CORDIC array is proportional to 3n^. As an 
example, the required silicon area of the chip described in /lO/, a IEEE-754 
single precision floating point CORDIC pipeline implemented with non- 
redundant adders, exceeds 150 mm^ in a 1.5 pm CMOS technology. Two thirds 
of this area are devoted to the iteration execution. Other published 
implementations with similar hardware efforts include /1 1,1 2, 13/. Obviously, any 
reduction in chip area due to algorithmic and architectural modifications would 
improve yield and decrease production cost. In addition it reduces the power 
consumption, which is considered to be of growing significance in wireless 
applications. The motivation for this work is, therefore, to reduce the chip area to 
allow the application of CORDIC array and pipeline architectures in die size 
critical areas, i.e. embedded control, but with no speed penalty. 

Previous work on area reduction has focused on different parts of the algorithm. 
At first, many researchers investigated methods to reduce the hardware amount 
necessary for scaling factor compensation by incorporating the scaling into the 
iteration or by optimizing the number of scaling iterations /1 5/. 

Further research concentrated on the optimization of constant scale factor 
algorithms emerging from the application of redundant adders, as mentioned 
before. Subsequently, some authors tried to reduce the number of iterations, i.e. 
15/ by applying a prediction scheme in the rotation mode resulting in a z-path 
reduction to roughly one third. This method can be also extended to the vectoring 
mode, as has been demonstrated in /1 6/. Thus the hardware requirements of the z- 
path decrease considerably for rotation and vectoring mode. 

The prediction of some o; in the rotation mode /5/ allows for partly recoding o; 
so as two iterations in x and y can be multipexed to one and selecting one of two 
different shifts. Recently, this concept has been generalized to vectoring by 
adapting radix-4 SRT-division operand-prescaling 111/ yielding a unified mixed 
radix 2-4 architecture with n/4-1 less microrotations than a pure radix-2 
architecture 161. 

One observation which can be made is that the aforementioned modifications 
trade algorithmic and architectural simplicity of CORDIC for an overall 
reduction of microrotations. Therefore, for non-redundant addition schemes the 
impact of decreasing magnitudes of x,- y,- and z,- in vectoring and rotation mode, 

respectively, has been investigated in /16/, yielding a decrease in the widths of 
the corresponding datapaths- and, consequently, lower silicon area. In the 
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following, we investigate whether this approach can be extended to the redundant 
case, too. 



4. AREA REDUCED ARCHITECTURES 

4.1 Rotation mode x- and y- datapath 

Fig. 1: Section of data- 
path for standard jc- 
iterations 



Figure 1 depicts the 
usual way to implement 
the X iterations (shown 
for m=l). The iterations 
for y are implemented 
correspondingly. 

Here we chose 4-2 
redundant binary adders 
for implementation. 
They can be realised in 
static CMOS circuit 
technique as given in 
/1 8/. The situation for 
carry-save adders is 
about the same. It 
should be noted that 
appropriate measures have to be taken to avoid pseudo overflows due to the 
redundant representation. This can be readily incorporated into the 4-2 redundant 
binary adder cell 719/ and will not be indicated here for sake of simplicity. For 
the same reason we do not consider any iteration repetition necessary for 
guaranteeing a constant scaling factor. Starting with the second iteration and the 
most significant digit an increasing number of zeros has to be added to the 
corresponding digits of jc. due to the shift sequence. 

In these positions, it is sufficient for proper iteration results to take into account 
the previous value of x, (y, in the y-datapath) and the transfer digit from the next 
lower significant digit position. Hence, a much simpler redundant addition cell 
can be employed. Fig. 2 demonstrates the corresponding architecture. We refer to 
this adder as a 2-2 redundant binary adder (RBA). It can be noted that not only a 
simpler addition resulting from one fixed addend occurs. 



MSO LSO => 




RR s Add two redundant binary digits, bit position index not shown 
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Thus the circuit for one digit position reduces from a 4-2 RBA, sign inverting 
logic, and two latches to one 2-2 RBA and two latches. In Fig. 3 a fully static 




CMOS version of a 4-2 
RBA requiring 42 
transistors is shown 
/1 8/. The redundant 
binary numbers are 
coded in sign and 
amplitude (value) 
format, indicated by 
indices s and a, 
respectively. The 
additional steering and 
inverting logic 

amounts to one inverter 
and one 2-1 
multiplexer, adding at 
least 6 transistors when 
employing 

transmission gate logic. 
In summary, we need 
48 transistors. The 
signals p and q 



Fig. 2: Reduced datapath for standard jc-iterations represent the internal 




Fig. 3: Static CMOS 4-2 RBA cell /1 8/ 



transfer digit. Dotted 
areas indicate AOI- 
gates. 

Fig. 4 exhibits a 3-2 
RBA cell, which was 
reported in /1 8/. It 
requires 22 transistors. 




Fig. 4: CMOS 3-2 RBA cell 718/ Fig. 5: CMOS 2-2 RBA cell 
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On the other hand, the 2-2 RBA can be directly derived from this circuit, yielding 
the circuitry shown in Fig. 5, requiring only 14 transistors. 

4.2 Rotation mode z-datapath 

For redundant adders, the original z-iteration according to Eq. 3 is modified to 
z,+i = 2(z, - Gj /4/. In this way, the critical bits to be examined for sign 

estimation are fixed at the same position for all iterations. While this is 
mandatory for recursive implementations it also improves the regularity of the 
physical implementation in array and pipeline structures, as shown in Fig. 6. 



Fig. 6; Section of 
datapath for standard z- 
iterations 



In this figure we 
identify the lower right 
triangle of adders (a 
RT-adder is a 3-2 RBA 
with sign steering 
logic) which do not 
contribute to the sign 
estimation, as they are 
adding zeros. Based on 
this observation we can 
diagonally prune this 
array of adders to the 
structure given in Fig. 7 
with no loss in 
precision or speed. 
Neglecting the hard- 
ware necessary for sign estimation, the gate count has been roughly halved. 

4.3 Vectoring mode x- and y-datapath 

For redundant adders, the original iterations for x and y according to Eqs. 1 and 2 
are usually modified to jc-^j or. y. (8) 

)'i^, = 2(y,. + CT,x.) (9) 
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Let us consider the jc-path and take into account that the absolute value of y. is 

kept constant to a 
specific degree by Eq. 
(9). Then it can be 
deduced that in the x- 
path about the same 
situation occurs as for 
X in the rotation mode 
(Fig. 1). However, 
with each iteration the 
amount of zeroes 
which have to be 
added to the most 
significant digit 
positions increases by 
two. Thus, referring to 
Fig. 1, in each 
iteration two (instead 
of one) more 4-2 RBA 
cells can be 
substituted by 2-2 
RBA cells (called RO) 
beginning from higher 
order digit positions 
Thus, even more 

hardware savings than in the rotation mode are possible. Furthermore, the jc- 
iteration can be halted after az/2+1 iterations as n-bit precision has been achieved 
/16,20/. 

Considering the y-path (Eq. 9) about the same situation as for the z-path in 
rotation mode exists. However, 4-2 RBA’s have to be employed. An increasingly 
number of zeroes can be added, starting from the least significant digit position. 
The error resulting from not adding the whole length of jc. and y. does not affect 
the overall iteration as it is hidden in the guard digits. So we can omit the lower 
right triangle of 4-2 RBA cells, yielding a reduction in gate count by one half of 
the standard architecture. 




RT = Add one redundant binary and one two's complement digit, bit position index not shown 

Fig. 7: Reduced datapath for standard ^-iterations 



4.4 Vectoring mode z-datapath 

The z-path and the corresponding iteration equation (3) resembles the x- and y- 
path in rotation mode. That is, we add constantly decreasing values of the 
rotation angle to z, . The resulting architecture is similar to Fig. 2, yielding an 
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upper right triangle of 3-2 RBA cells (Fig. 4) and a lower left triangle of 2-2 RBA 
cells (Fig. 5). Again, the hardware savings are not only due to the simpler adder 
cells, but also caused by the partly dispensable sign steering logic. 



5. EVALUATION 

As speed (latency as well as throughput) is not affected by our reductions we 
concentrate on a discussion of the area requirements. It should be noted that the 
described modifications are not only fully applicable to redundant constant scale 
factor methods which employ correcting iterations 73,4,6/ but also partly 
applicable to the „absolute value“ method given in 787. Therefore, as unmodified 
references for comparison we assume the mixed radix 2-4 architecture 161 and the 
„absolute value“ method 787. 

As our discussion is based on the assumption that we use correcting iterations we 
have to establish a base for comparison. We assume that we assimilate enough 
higher significant digits to guarantee that only each fourth iteration has to be 
repeated. For a detailed discussion of this topic see, for example, 74,67. Also, no 
scaling factor compensation, no sign steering logic, and m = 1 is assumed and no 
guard and overflow digits in order to keep the comparison simple. The standard 
architecture then requires 1.25n iterations with full length datapaths. 

A 4-2 RBA, 3-2 RBA, and 2-2 RBA occupy an area equal to 2, 1, and 2/3 full 
adders, respectively. This assumption can be proved when comparing the 
transistor counts of the corresponding figures 3,4, and 5. 

The improved correcting iterations method 76/ and our architectures employ 
(j.G {-1,1 } for 1 < n/4, a.€ (-1,0,1 } for n/4 < i ^72 using the approach given in 75/ 

which avoids correcting iterations, and radix-4 iterations for i<nl2 . 

Our architectures are also supposed to use only one third of the ^-path and a 
parallel prediction of all further cr. in the rotation mode 716,217. In vectoring 

mode, as for S(mf) > n/3 a single summing of the bits is 

necessary, reducing two third of the z-path to one redundant subtraction 7167. 

5.1 Rotation mode 

The „absolute value“ method needs lJ5n latches and 5n equivalent full adders 
787. The mixed radix 2-4 architecture 161 requires n/4 * 1,25 + n/4 + n!4 = 
0.8125n iterations. In the first half of the iterations it uses three full length 
datapaths, in the second half only two datapaths are necessary by using the 
prediction method. Thus (3 * 2 * 2.25nl4 + 2*2* n/4)*n = 4.375nMatches (note: 
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two latches per digit) and (2.25n/4 *(2*2+1) + m/4 * (2+1)) n = 3.5625 n 
equivalent full adders are necessary. 

Our architecture as depicted in Fig. 8 (r=2.25n/4) has the same number of 
iterations as the mixed radix approach. We need 2*[2 * (2.25m/4 * m/2 * 3/2 + m/2 
* m/4 * 1/2 ) + 2/3 * (2.25m/4 * m/2 * 1/2 + m/2 * m/4 * 3/2)] = 2.375m' equivalent 
full adders for x and y. The z-datapath contains 7=m/3+0.25m/4 = 0.4m iteration 
stages with (normally) 0.4m' 3-2 RBA’s. However, 1/3m * 0.4m * V 2 of these 
adders can be omitted, yielding m'/3 equivalent full adders for z. On the whole, 
the suggested architecture needs 2.7m' equivalent full adders. 

The latch count is the same as in the mixed radix case minus the savings in z, thus 

2*(2 * 2.25m'/4 + 2 * m'/4 + 0.4m' - 1/3m * 0.4m * Vi) = 3.9m' latches. 

Xo Yo Zo 




Fig. 8: 

Combinational area 
of proposed 

architecture (latches 
in X and y must have 
full length) 



5.2 Vectoring mode 



The absolute value method needs mV6+9.5m' latches and 5m' equivalent full adders 
/8/. The mixed radix 2-4 architecture /6/ requires the same hardware as in the 
rotation mode as it represents a unified architecture. The determination of cr. is 
much more complicated due to the necessary prescaling of x and y. To simplify 
the comparison we build upon the same approach, but do not take into account 
the necessary hardware. 

In our proposal, the jc-iteration stops after 2.25m/4 iterations. So the jc-datapath 
can be implemented with 2.25m/4 * m * Vz 4-2 RBA’s and 2-2 RBA’s each, 
resulting in 2 * 2.25/4 * V 2 + 2/3 * 2.25/4 * V 2 ) = 0.75m' equivalent full adders. 

The y-path requires 4.25m/4 iterations and on the average m/2 4-2 RBAs yielding 
about n full adders. In the z-path we have 0.4m iterations with 0.2m' 3-2 and 2-2 
RBS each. So we have m'/3 full adders and in summary for all datapaths slightly 
more than 2m' full adders. 

The latch count is 2 * 2.5m' = 5m' . 
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6. CONCLUSION 

The results are summarized in Table II. 



Rot. 


/8/ 

adders 


latches 


161 

adders 


latches 


this pr 
adders 


oposal Table II: 
latches Comparison 


5n 


7.75 n 


3.5 n 


4.375 n 


l.ln 


3.9n 


Vect. 


5n 


nl6-¥ 

9.5n 


3.5n 


4.315n 


2n 


As can be seen 
significant savings 



have been achieved. To assess this results we are developing parametrized 
VHDL models of these architectures. Currently, we have modeled a rotation 
mode standard radix-2 redundant addition scheme with full length datapaths and 
all iterations and a radix-2 architecture with reduced x- and y- datapaths 
according to section 4.1 and a z-path containing about n/3 iterations and the 
parallel prediction. These architectures have been verified by extensive 
simulation and synthesized for an external wordlength of 16 bit and mapped onto 
a l.Op CMOS standard cell design. 

The overall area reduction obtained is 20%. It should be noted that this result has 
been achieved at the first run and no optimized library for the improved RBA’s 
was available. We expect further reductions by refining the model and the 
library. The improvements are valid for implementations in both correcting 
iteration and absolute value architectures 
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Abstract 

We present the Rainbow hardware design environment for asynchronous mi- 
cropipeline systems. Rainbow contains a suite of user-level sub-languages for 
multi-view design description. An underlying formal model defines the be- 
haviour of Rainbow components and thus provides a means of determining 
their combined behaviour. A simple processor design is used to illustrate the 
integrated description style provided by Rainbow and a smaller example shows 
how the components interwork at the semantic level. 

Keywords 

Multi-view design, micropipelines, asynchronous systems, formal methods. 



1 INTRODUCTION 

The Micropipeline method (Sutherland 1989) for designing asynchronous hard- 
ware systems has been used successfully for the development of commercial- 
scale devices, including asynchronous versions of the ARM processor in the 
AMULET Project (Paver 1994). However, the perceived benefits of asyn- 
chronous design — the production of high-speed and low-power systems — 
have so far been difficult to realise due to the relative lack of design represen- 
tations and associated tools specialised to asynchronous micropipeline hard- 
ware. Certain design methods are equipped with description languages, such as 
Philips’ Tangram system (van Berkel 1992), or the synthesis approach adopted 
in (Martin 1990). However specialised design languages have not yet emerged 
for micropipelines, especially for higher-level abstract descriptions. Designers 
also require simulation tools for early (high-level) experimentation, and anal- 
ysis tools for detecting deadlocks, estimating performance, or for checking 
design equivalence. Of course, such analysis tools must be based on proper 
foundations, i.e. it is essential that the design language has a usable formal se- 
mantics. Current formally based approaches have usually been limited to cap- 
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turing the request /acknowledge control signalling between micropipeline com- 
ponents, e.g. the CCS models of the AMULETl processor (Liu 1995). Others 
have developed special process algebras for delay-insensitive systems (Josephs 
& Udding 1990), or have used Petri Net models (Yakovlev et al. 1995). 



1.1 The Rainbow Approach 

Our approach has been to develop Rainbow, a suite of user-oriented, multi- 
view, description languages specialised for modelling micropipeline designs. 
A single design can be described in a variety of styles, using the language 
most appropriate for each design component. The Rainbow languages have 
been developed in close collaboration with the AMULET Group at the Uni- 
versity of Manchester, so that they will closely match the needs of practising 
hardware designers. The formal semantics of each of the Rainbow languages 
is defined via a translation to a common underlying process term language 
called APA (Asynchronous Process Algebra), which also operates at the mi- 
cropipeline level: basic micropipeline components and combinators are mod- 
elled by atomic APA components and combinators. The semantics of APA 
is defined via SOS-style transition rules (Hennessy 1990). This provides the 
foundation for the development of formal analysis tools. The general approach 
adopted to the Rainbow semantics follows the successful line we took with 
developing formal analysis tools for the (synchronous) hardware description 
language ELLA (Barringer et al. 1996b). 

In section 2, we describe Rainbow and the component sub-languages. An 
example of the design development style now possible is illustrated with a 
description of a simple processor (section 2.3). We then present an outline 
of the APA semantics for Rainbow (section 3) and show how this is used to 
provide interoperability between the different sub-languages (section 4). 



2 RAINBOW 

The Rainbow framework currently includes a dataflow-style language (Green), 
based on micropipeline communication and primitives, for hierarchical struc- 
tural descriptions, with schematic and textual versions. It also includes a 
control-flow style language ( Yellow) for algorithmic descriptions, using Ada- 
like rendezvous communication (Li 1982, Sommerville & Morrison 1987) be- 
tween components. We are extending the framework to include a high-level 
language (Red) for behavioural/specification descriptions using, for example, 
temporal logic or stream transformers, and a Blue language to operate at a 
level below Green, by exposing the handshaking control, similar to the CCS 
models of AMULET in (Liu 1995). 
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2.1 Green 

Green is a static dataflow language (Herath et al. 1992), using explicit flnite 
buffering between elements*, that models micropipelines. Buffers are used to 
introduce state and provide decoupling between inputs and outputs — an 
empty buffer can accept an input; once this has been acknowledged then the 
(full) buffer can output the value, and is only ready to accept a new input 
when the output has been released. A stateless element can only fire when all 
of the required inputs are present. It then generates output values, but does 
not release any input channel until all its outputs have been accepted and 
released. For example, a Duplicate node simply copies its input to a number 
of outputs; sources (or sinks) generate (or absorb) streams of values. Dataflow 
control elements provide ways to conditionally merge and split data streams. A 
more generalised element is the Table, which combines functionality and flow 
control. Elements can be combined serially (as a pipeline) or in parallel. Refer 
to (Barringer et al. 1996a) for more details of Green and its editing/simulation 
tools. 

As an example of Green, Figure 1 shows part of the simulation of a schematic 
computing sequences of Fibonacci numbers (an APA translation will be given 
in section 3). Buffer C has just received the next Fibonacci number (8 = 3+5), 




Figure 1 Fibonacci Calculator Design 



buffer B is supplying the old Fibonacci number (3), and buffer A supplies 
the current number (5). The channel labels with a light background indicate 
values that have successfully been transferred, and the labels with a dark 
background show values that are ready to be transferred but are currently 
blocked. Therefore, only one of the values from the Dup element has so far 
been consumed, with the second value 5 waiting to be written into buffer 
B. When this has occurred, then buffer A will be ready to receive the new 
Fibonacci number (8) from C. 



*Note that wires are delay-less. 
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2.2 Yellow 

In contrast to the static structural description style offered by Green, Yellow 
allows evolving descriptions to be constructed, using a control-flow style, sim- 
ilar in many respects to CSP (Hoare 1985) or Ada. The (explicit) call/accept 
communication primitives used are adaptations of the Ada rendezvous. An 
input statement ‘accept inps do P end’ starts execution of P when all of 
the inputs inps are ready. When P terminates then the values on inps are ac- 
knowledged, and the processes supplying the inputs are then freed to continue. 
An output statement ‘call outs’ makes the outputs outs available, and then 
is suspended until all of the outputs have been consumed by matching inputs 
in accept statements. Yellow also includes the standard imperative sequence 
(P ; Q) and assignment operators (r := e). A choice construct provides the 
means for switching control conditionally between different code fragments. 
A loop construct ‘loop grd — > P end’ repeatedly evaluates P if grd is true, 
otherwise the construct is exited. Other Rainbow components, such as Green 
procedures, can be instantiated in Yellow descriptions. 



2.3 Example: SMPU — a Simple Processor 

A simple processor design example, SMPU, uses interacting Green and Yellow 
components and illustrates the mixed-view style of design description sup- 
ported by Rainbow. Figure 2 shows the top-level block structure described in 
Visual Green, a style similar to that used in informal descriptions given in 
design documentation, such as for AMULETl (Paver 1994). 




Figure 2 Visual Green SMPU Figure 3 Registers 



The functionality of each block is then described in the most appropriate style. 
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For example, decode provides the basic control of SMPU and sequences the 
fetch-decode/execute cycle for each instruction; this can therefore be described 
initially using Yellow. First, it outputs controls to the SMPU components to 
fetch an instruction from memory. When the instruction is returned, decode 
stores it in register ‘i’, and then proceeds with decode and execution: 



yellow decode (input instruction: instr; 

output memIFC: memctrl; output regIFC: regctrl; 
output aluIFC: aluctrl) = { 
loop TRUE “> { 
reg i: instr end 

// — Fetch (pc is at register 0) : 

par call memIFC! is. instr 

II { call regIFC!(ARG.A,0) ; call regIFC! (WB,0) } 

I I call aluIFC! FETCHING 

II accept instruct ion?ins do i:=ins end 
end.par ; 

// Execute: <code omitted> 

} end; 

} end 



The ALU is described using the Table construct of Green, which offers a 
concise functional description. 



table alu.block (input aluIFC: aluctrl; input a,b: val; 







output mrd: memread; 


output mwr: memwrite; 






output writeback 


val) 


= { 




FETCHING, 


a, - 


=> 


a. 


” 9 


a+1 


//a is pc value 


or 


FROMMEM, 


a, b 


=> 


a+b. 


** 9 


- 


//a is base addr, b is offset 


or 


TOMEM, 


a, b 


=> 


" 9 


(a,b) , 


- 


//a is addr, b is value 


or 


ADD, 


a, b 


=> 


"" 1 




a+b 


// — Binary ALU operations 


or 


SUB, 


a, b 


=> 


" 9 


-, 


a-b 




or 


AND, 


a, b 


=> 


"" 9 


” 9 


a/\b 




or 


CMP, 


a, b 


=> 


" 9 


“ 9 


a>b 


// — Unary ALU operations 


or 


MOV, 


a, - 


=> 


9 


9 


a 




or 


NOT, 


a, - 


=> 


"" 9 


— ^ 


not a 





} end 



A row in the table is executed when the values on its input can be matched 
with the input patterns (to the left of the ‘=>’)? causing the expressions (to the 
right of the ‘=>’) to be output. The entries may contain literals, e.g. FETCHING, 
or variables, e.g. a, which must match the input value. An ‘empty’ entry 
means that an input is not consumed or an output is not produced. For other 
components, such as the register bank, more detailed (structural) descriptions 
may be given using Green sub-networks, as shown in Figure 3. 
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3 APA 

The interfaces between the various components described in Green and Yellow 
must be well-defined, and is achieved by translating the sub-language compo- 
nents into a common underlying process term language called APA. This has 
some similarities to more familiar process algebras, such as CCS (Milner 1989), 
CSP (Hoare 1985) or LOTOS (Bolognesi & Brinksma 1987), having the usual 
types of process operators and an operational semantics. However, our compo- 
sition operators are designed to support bundled-data micropipeline commu- 
nication, and their semantics resembles that given in (Li 1982). APA supports 
value-passing and has richly-structured actions, which leads to a compact se- 
mantic representation. It has the following syntax: 



D ::= pnm{fargs) ^ P Proc defn 

P ::= skip I fail Atomic proc 

I sync{in 5 }{otxt 5 }{Pi} Synchronise 

I grd Pi Guard 

I Pi ; P2 Sequence 

I Pi I P2 Parallel 



Pi -I- P2 Choice 

Pi\{ch} Restrict 

{v X : S • Pi) Var bind 
pnm(args) Instance 



The semantics for APA, defined via SOS-style rules, determines the possible 
transitions of each term. A transition is labelled by an action containing sets 
of input/output channel bindings and a fiag indicating whether the process 
has terminated. For example, the action ‘a = (F, (ii?5,c?true / o!5))’ is from 
a non-terminating process, with values 5 and true being input on channels ii 
and c respectively, and the simultaneous output on channel o of the value 5. 

The familiar process algebra combinators used in APA have their usual mean- 
ing. For example, the sequence combinator executes Pi and then P 2 . The 
parallel operator evolves both component processes independently combining 
any resulting actions with an action product operator. When either process 
terminates, indicated by an action with channel bindings a, execution 

simply continues with the active process. However, since APA uses richly- 
structured actions then the action product is more complex; if the values 
bound to same-name output and input channels match, then references to 
that channel are removed from the action label, otherwise the transition fails. 
Any unpaired channel is left in the label unchanged. 

Of particular interest is the synchronisation construct ‘sync{ms}{outs}{Pi}’. 
Instead of a prefix action being performed and then followed by a separate 
continuation process, sync ensures that input action synchronisation is not 
completed until both the outputs have been consumed and the body has 
completed. ‘Blocked’ actions are introduced, written as 7r(a) for an action 
a: when all of the inputs ins are available then the outputs outs and body 
process P can begin execution (rule Sync-1), utilising any of the input values; 
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meanwhile the inputs are ‘blocked’. Having completed execution of P and 
outs, then the inputs are consumed (Sync-2): 



Sync-1 



for some 6' C 6 b" =b — b' P - > P' 



sync{a}{6}{P} ^ sync{a}{6"}{P'} 



Sync-2 



pjh^p. 



sync{a}{6}{P} ° * (T. c) ^ ^ 

where 6" is the difference between the two channel maps 6,6'. Therefore sync 
can be used to model micropipeline communication directly, without resorting 
to separate control channels. 



3.1 APA for Green 

Green constructs include buffers, duplicators, merge/split and functions. These 
can be modelled as APA processes, for which we can derive transition rules. 
For example, the behaviour of a buffer B[ ] is given by the two rules*: 




Rule Buf-1 shows that a full buffer B[x] storing some value x can transform 
into an empty buffer B[—] via a transition with label o\x. Similarly Rule Buf- 
2 shows that an empty buffer B[-] can transform into a full buffer B[x] by 
accepting some value x from set 5 on an input channel i, indicated by the 
transition label i?x. The derived rules follow directly from the APA transition 
rules for skip, sync{_} {>}{-} sequence. 

Green split and merge constructs provide demultiplexing and multiplexing 
operations. For example. Split directs the value on its input to either output 
oi or to 02 depending on the control value supplied in channel c: 



Sp-1 



for some a; € 5 



Split ^ Split 



Sp-2 



for some x £ S 



^ ... tix, cl false / 02\x _ ... 

Spilt — — — > Split 



* Because Green processes never terminate, then there is no need to show the termination 
flag on each transition label, since it is always T’. 
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The Green stream-copying operator makes duplicate values of its input avail- 
able on independent outputs. It does not acknowledge the input until both of 
the output copies have been consumed: 



D-1 



for some x 



G S 



D-2a 



for some x 



G S 



Dup<> 



ilx / oi ! x , 02 !® 



^ Dup<> 



Dup<> 



7 r(i?x) / o\\x 






D-3a| 

ilx / 02 !x _ 

Dup<02!x> ^ Dupo 

with corresponding rules D-2b, D-3b for when output 02 is consumed first. 
The Green combinators for parallel and pipe are modelled using APA parallel, 
restriction and renaming in the usual way. 

For example, the APA term that represents the Green description of the Fi- 
bonacci calculator (Figure 1) is: 

B[b]{i^ch) » Dup » {Id{)\B[3]) » Add » B[-~]{o^ch) 



3.2 APA for Yellow 

In a similar way, we model Yellow primitives as APA processes, and derive 
transition rules describing their behaviour. Many Yellow primitives and com- 
binators have direct correspondence with terms defined in APA. 

The input channel communication statement ‘accept {ins} do P end’ trans- 
lates into the APA term ‘( 1 / Xins • S • sync{ins}{}{P})’, where Xins are 
the free variables in the input channel bindings ins. The output statement 
‘call outs’ translates into the APA term ‘sync{}{outs}{}’. Yellow sequence 
and parallel are also supported directly in APA. 

At the pure APA level, registers in Yellow are modelled by processes. However, 
for clarity in the derived rules for Yellow, we introduce an evaluation envi- 
ronment a to store bindings, used during evaluation of any expression e via 
a meaning function M-[e]o-. Assignment to registers updates the evaluation 
environment a, as shown in the following derived rule: 



Assign 



I M-[e]a = V 



(T,0) 



< r := e, cr > ^ < 0, cr f {r >--» t;} > 



For convenience, a loop operator ‘loop grd — > P end’ is introduced, which 




Multi-view design of asynchronous micropipeline systems using Rainbow 273 



executes P repeatedly when grd evaluates to true, and exits otherwise. The 
following are derived rules for the loop construct: 



M-lgrdla = true < P, a > ■ ■» <i P\ a' > 

< loop grd — > P end, a ; loop grd — > P end, a' > 

= false 

< loop grd — y P end , a > < 0, c > 



APA thus provides a single uniform semantics for current Rainbow sub- 
languages, thereby supporting interworking — suitable translations from Rain- 
bow to APA are defined. The task of translation is made easier by introduc- 
ing APA processes and corresponding derived transition rules, to model many 
Rainbow components directly. 



4 INTEROPERABILITY IN RAINBOW 



A simple block adder example is used to illustrate interoperability of Yellow 
and Green sub-designs; the top-level is given by the Green network in Figure 4. 
This contains sequential control and structural datafiow components, just as 



^ HandQm ScHirc* J 
Hum 



Figure 4 Top-Level Green Block Adder Unit 

the SMPU in section 2.3 does. Yellow and Green are used respectively to 
describe the components (Figures 5 and 6). 

The block adder unit BAdder first reads an integer value x from the input 
channel size, and then adds the next x values presented on channel num 
before outputting the cumulative sum to channel sum. The Yellow control 
reads size and then sends true to indicate that adder is to output the sum 
and prepare to start a new summation, or false to indicate that adder is to 
read the next input and add it to the sum accumulated so far. 

We now show how the Yellow and Green fragments are translated into APA 
and the APA transition rules are used to determine possible behaviours. For 
the Yellow control process, statements are executed in sequence, so that the 
control-fiow can be followed through the code. For example, the initial be- 






274 



Part Seven Specification and Simulation at System Level 



yellow control (input size: integer; 

output Ctrl: boolean) = { 
reg count: integer end 
call Ctrl! true; 

loop true -> // — Loopl 

{accept size?x do count :=x end; 
loop (count > 0) -> //-Loop2 stao-t 
{call Ctrl! false; 
count := count - 1} 
end; 

call Ctrl! true} 

end 
} end 



Figure 5 Yellow control 




Figure 6 Green adder 



haviour of control is for it to output Ctrl! true, reaching Loop1. Figure 7 
shows part of the semantics for control as a transition diagram. There are 
three states controf Loopl and Loop2, with the starting state circled. Each 
transition is labelled by the possible input /output actions, together with con- 
ditions for execution and any register update that occurs in the transition. 
For example, when count > 0 there is a transition from Loop2 back to itself, 
with output Ctrl I false and the register being decremented in the next-state 
(i.e. count' is the next value of register count). 



control Loopl Loop2 

ctrlltrue ^ size?x 



(»)- 



ctrlltrue 
{cmnt < 0) 




(Ulder({< -.tot >) 



addpr{< tot, — >) 



iuhler{< — ,0 >) 



ctrl?fal8e, niim?x 



ctrl?false, nmn?x 



ctrllfalse 
{count > 0) 
countf = count — 1 




ctrl?tnie, nuin?x, sumltot ctrlVtnie, nrnnVx, siiinlO 
tolf — X tot' — X 



Figure 7 Transitions for Control Figure 8 Transitions for Adder 



In contrast, the Green dataflow description of the adder does not use the same 
notion of execution, in that there is no locus of control for code execution. 
The activity of each component is determined by its state and the various 
actions that can be accepted from its inputs or written to its outputs. The 
behaviour of the complete network is taken as the possible legal combinations 
of component activity. 

For example, let the Green adder network be represented by ‘adder (<a,b>)’, 
when the two buffers A, B have the values a, 6. If the network is in state 
‘adder (<-,0)’ and it accepts ctrl?true, num?x on its inputs, then one re- 
sulting behaviour is for sum!0 to be output, with the new state adder (<x,->) 
due to the following component transitions: 




Multi-view design of asynchronous micropipeline systems using Rainbow 275 




Figure 9 Transitions for Combined Machine 



_ ctrlltrue / cV.true, c2\true, cV.tr ue _ 

Dupo > Dupo 

^ num?x^c2?true / sl!a; ^ ... 

Split : > Split 



Par 



sl?x,cl?true / savelx .. 

Merge y Merge 

_ ... tot?0, c3?true / sumlO _ ... 

Split : y Split 



B[-] — 

B[y] B[~] 



-^B[x] 



, , , « X ctrlltrue, numlx / sumlO , , / \ 

adder{< 0 >) > adder{< x, - >) 



Figure 8 shows a (partial) transition diagram for the Green adder. 

In order to combine the Yellow and Green components in the Add unit, the 
control output Ctrl from the Yellow control, together with input num are 
piped into the Green adder as shown in Figure 4. In this case, the Pipe 
transition rule determines how the Yellow and Green processes interact; for 
example, the first initial move of the system may be calculated as follows: 



. . - , , ctrlltrue, numlx ,, . , 

(a) {control \ num} > {Loopi | num} 



Pipe 



(b) adder(<— ,0>) 



ctrlltrue, numlx / sumlO 



■> adder(< x, — >) 



[control I num} » adder{< — ,0 >) 

{Loop1 I num} » adder{< x, — >) 



Figure 9 shows the ‘product’ transition diagram for the complete system, 
calculated from Figures 7 and 8 using the Pipe transition rule. 



5 SUMMARY 

We have described the Rainbow hardware design environment that supports 
the development of user-level, mixed-view descriptions of asynchronous mi- 
cropipeline systems. An underlying formal semantics provides the means for 
defining the interworking between design components described in the differ- 
ent Rainbow sub-languages. Prototype design and simulation tools for Rain- 
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bow have been developed, and the formal semantic foundation gives the basis 
for development of formal analysis tools. 

This work was supported by the UK Engineering and Physical Sciences Re- 
search Council via research grant GR/K42073. 
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Abstract 

A methodology that efficiently translates Estelle formal specifications into a 
VHDL description, suitable lor High Level Synthesis of communication protocols 
is proposed. The effect of the protocol description style in VHDL on the result of 
the HLS scheduling step is discussed by report to the Dynamic Loop Scheduling 
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1 INTRODUCTION 

The increasing performance of the computer networks relies basically on the 
increasing speed, reliability, decreasing cost of the VLSI circuits and the 
development of optical fibers as a transmission media that enabled the emergence 
of new applications in multimedia. 

Despite the progress on the VLSI circuits design methodologies and fabrication 
technologies, the main limiting factor in the performance of a computer network 
still is the amount of processing needed to run protocols in workstations and 
network servers. These technological issues require modifications in the protocol 
design and implementation. High performance in communication subsystems can 
be obtained by software optimization, by the use of parallel structures in the 
construction of protocols and by hardware implementation of these protocols in a 
VLSI circuit. 

In this last case, the protocol designer should be concerned by the time required 
to design a VLSI circuit as compared with its expected commercial lifetime. 

Integrated circuits are currently designed in a semi-automatic way by using a set 
of design tools generally integrated into a design environment. The design process 
generally begins with a high level system specification where this term refers to a 
system level behavioral hardware description. But, in the case of the 
communication protocols, the design process involves system specifications 
written in a level of abstraction even higher than that commonly used for digital 
systems, as is the case of the Formal Description Techniques (FDTs) such as the 
ISO Estelle language (Budkowski, 1987). As a consequence, the high speed 
protocols designer will face the problem of interacting with two design 
environments: one for the protocol specification, verification and testing and the 
other for the synthesis, simulation and physical implementation of a circuit. In 
general, these two environments are based on different description languages 
requiring an interface to translate one language into the other. Several efforts have 
been made to translate a system level specification into a Hardware Description 
Language (HDL) such as the work of Kloos (Kloos, 1993) on the translation of 
Lotos into VHDL and those of Pirmez (Pirmez, 1995), and Wytrebowicz 
(Wytrebowicz, 1995), on the Estelle - VHDL mapping. 

The High_Level Synthesis of Protocol controllers requires the use of specific 
HLS tools called Control Flow Dominated Behavioral Compilers (Walter, 1991). 
These tools are able to handle large control structures that include sophisticated 
handshaking and non-structured sequences. However, most of these compilers 
have a drawback: the compilation results depend on the quality of the input 
description because a specification in VHDL (Ashenden, 1990) may include 
explicit synchronization points (e.g. a wait in a VHDL description) which restricts 
the scope of the transformation and optimization usually performed in behavioral 
descriptions (Bhasker, 1990). This paper presents a methodology to efficiently 
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translate Estelle specifications into VHDL descriptions to be used by synthesis 
tools. The effect of the protocol description style in VHDL (Pirrnez, 1996) on the 
result of the HLS scheduling step is discussed by report to a typical scheduling 
algorithm, the Dynamic Loop Scheduling (DLS) (Rahmouni, 1994), implemented 
in the HLS tool AMICAL, a VHDL behavioral compiler for control flow 
dominated machines (Park, 1993). In order to test the proposed methodology, a 
high speed protocol based on the ISO reference protocol ABRACADABRA called 
ABRACADABRA_HS was devised. Section 2 describes a methodology to 
translate Estelle specifications into VHDL descriptions, suitable for synthesis tools. 
The different styles that can be used to describe a protocol in VHDL and theirs 
results of HLS are discussed in Section 3. The last section deals with the results 
obtained so far. 



2 TRANSLATING A ESTELLE** INTO A VHDL 

The main steps involved in the proposed methodology of HLS of communication 
protocols are shown on Figure 1. 
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Figure 1 Protocol design environment 

The starting point is a correctly compiled and simulated Estelle specification, 
obtained in a suitable protocol design environment, that is mapped into a VHDL 
description which is used by the circuit design environment. The following 
discussion concerns the translation of this specification into a VHDL description. 
Particularly, we are interested on criteria to guide the protocol's designer among 
the many different possible solutions, leading to an HLS-efficient Estelle- VHDL 
translation. 
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2.1 The Translation Strategy 

An Estelle specification is composed of a set of modules hierarchically 
organized and a VHDL description is composed of a set of entities described 
separately. Thus, each Estelle module maps naturally into a VHDL entity, as 
shown in Figure 2-(l). 




Figure 2 Mapping Estelle into VHDL 



The hierarchy among Estelle modules is maintained through a VHDL 
component declaration statement, as shown in Figure 2.(3). 

Each VHDL component statement declares an entity and its description is placed 
in design libraries. The VHDL component instantiating construct is used for 
creating VHDL entities instances (Estelle init function) whose ports are either 
connected to actual signals or to the entity ports (Estelle connect and attach 
functions), as shown in Figure 2.(4). The binding of a VHDL entity to its instance 
(component instantiation) is achieved through a configuration declaration, as 
shown in Figure 2.(5). 

Each Estelle module is composed of two parts: the header (module) and one or 
more bodies. This corresponds to the VHDL entity-architecture pair, with each 
Estelle header (module) corresponding to a VHDL entity and each Estelle body 
corresponding to a VHDL architecture, as shown in Figure 2.(1) and 2.(2). The 
main Estelle module (specification) does not has an external view and, 
consequently, its corresponding interface is empty. 

Since there are no VHDL equivalent mechanisms to the Estelle synchronous/ 
asynchronous types of parallelism among module instances, an efficient Estelle- 
VHDL translation will require the restrictions included in Estelle* by Courtiat 
(Courtiat, 1988). This allows to represent in asynchronous form the parallelism 
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among activity type module instances within a system activity type specification. 
As a consequence, only the semantic of an Estelle activity module is implemented 
in VHDL by creating an entity called HIERARCHY_QUEUE_ENTITY. This 
entity will forward an interaction to its queueing entity. 

The Estelle language has a built-in FIFO queue mechanism that has no 
equivalent in VHDL. The solution adopted and implemented to overcome this 
limitation was to create an entity in VHDL, called QUEUE_ENTITY, to reproduce 
only the queueing management mechanism of an Estelle module. Since there is 
one FIFO queue for each interaction in a Estelle specification, it will be necessary 
to allocate VHDL signals to provide the communication among entities. 
Furthermore, a port called next is added to the entity that implements a module and 
to the queue management entity of this module. This port enables the entity that 
implements a module to search for the next message into its corresponding 
QUEUE^ENTITY. 

In the header of an Estelle module, each interaction contained in the set of 
interactions of each interaction point of an Estelle module will correspond a port 
(signal) of a VHDL entity, as shown in Figure 2.(6). The signal type will be input 
(output) if the interaction point role is reception (transmission). 

An Estelle body is composed of three parts: declaration, initialization and 
transition part. The declarative part contains Pascal-like statements and the 
declarations of the Estelle objects. The Pascal-like statements in Estelle may be 
converted into similar VHDL declarations with some restrictions: variables with 
tails variables are not allowed in VHDL; and the variables in the declarative pait of 
a Estelle body should be declared in the declarative part of a VHDL process 
instruction. 

By report to the Estelle objects declarations, the following procedure applies: 

• A Estelle module translates into a VHDL entity ; 

• To specify the behavior of each Estelle module, defined in terms of a state 
machine, a signal will be created to store the current state of the 
corresponding VHDL state machine; 

• The interactions associated to each internal interaction point of each Estelle 
module types will correspond to the internal signals of a VHDL entity. 



The execution of the initialization part of a module is performed sequential)' and 
only when the given module instance is initialized. The variables initialization 
process, 'Var;= x”, in both languages is similar. In Estelle, the state initialization 
instruction ''to state", the module instance creation "m/r" and the creation of 
bindings between interaction points '’attach" and "connect" are directly converted 
into a signal attribution ’’state <= value” plus a component instantiation and signal 
attribution instruction, respectively. 

The behavior of a module is described by the initialization and the transition 
parts of the body. The execution of both parts is sequential. The translation of 
Estelle clauses into VHDL structures is trivial and the set of structures obtained in 
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VHDL is included in a process statement. The Estelle clauses "from state", "when 
interaction-point interaction", "provided expression", "to state" and "output 
interaction-point interaction(msg)" are directly converted into "when state => ... 
of case structure", "wait until interactionjcondition" , "if expression then ... 
''state : = state value" and "interaction <= msg" in VHDL. The Estelle 
instructions "procedure name (formal parameters queue)" and "function name 
(formal parameters queue): return " are directly converted into the equivalent 
VHDL declarations. 

Only the Estelle delay clauses "delay(T)" or "delay (T,Tf will be translated in 
VHDL. An entity called CLOCK_ENTITY_TYPE is created in VHDL as a time 
counter with the following ports: Count, Period and Finished. The time counting 
starts when the CLOCK_ENTITY_TYPE entity receives a Count signal and the 
QUEUE_ENTITY of a given entity returns a no message signal when this last 
entity asks for a next message. If the counting value is bigger than the value 
buffered in Period, the CLOCK_ENTITY_TYPE entity sends a time-out signal, 
Finished, to the entity that executes the state machine. When the entity that 
executes the state machine receives the time-out signal, the time-out treatment 
instructions are executed. 

The mapping strategy may be divided in three steps. Initially, all module types 
in a Estelle specification and their hierarchic relation are identified. Then, each 
module type is implemented as an independent VHDL entity. The modules 
hierarchy is established by means of a VHDL component construct. Finally, queue 
management entities are created, one for each entity and one for each group of 
child entities. 

To structure each VHDL entity the following procedure is proposed: 

To a FSM described in the body of an Estelle module, there will correspond a 
state machine described in VHDL. However, VHDL does not have a build-in 
concept of state, it will be necessary to build-up the VHDL state machine. To this 
end a VHDL entity is created with an associated architecture containing a 
behavioral description of the FSM, obtained directly from the Estelle specification, 
as follows: 

1. The set of different states of an Estelle module will form a VHDL case 
structure. 

2. For each state, transitions associated with this state are grouped by interaction. 
The "wait until conditionjnteraction" instruction will stop the execution when 
the clause condition_interaction is evaluated to true. If there is more than one 
conditionjnteraction to be evaluated in the same state, an "if VHDL construct 
will be placed just after the "wait until conditionjnteraction" statement. 

3. To avoid code duplication when specifying exceptions, these signals (reset, 
error situations) should be created and added to the sensitivity list of the "waiC 
statement. Thus, the exceptions signals are checked immediately after this 
"waif statement. 

4. Only one state at a time is active in this state machine. Furthermore, due to the 
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filtering performed by the QUEUE_ENTITY entity, only one interaction is 
activated at a time. The resulting state machine should be placed inside a 
VHDL process, enabling the repeated execution of the state machine. This state 
machine will remain in stand-by until a signal change occurs. 

2.2 The AMICAL Environment 

The AMICAL environment used for the HLS was based on a structured design 
methodology (Park, 1993) to allow the use of the hierarchy concepts in the 
synthesis process (Kission, 1995). 

The synthesis process starts with two types of information: a behavioral 
specification given in VHDL and an external library of functional units. In this 
discription, each entity is composed of only one process. This process may include 
complex sub-systems by using procedure and function calls implementing in this 
way the concept of hierarchy among modules. However, for each procedure or 
function used, the library should contain at least one functional unit able to execute 
the corresponding operation. 

From the point of view of the HLS, a structured design is composed of a system 
and a set of components, described at the behavioral level. At the behavioral level, 
a component is described by a VHDL entity that, in the AMICAL environment, is 
invoked through either procedure or function calls. A component may correspond 
to a design produced by external tools or to a sub-system originated on a previous 
design. The abstract concept of behavioral component was used to allow re-use. 

As each sub-system must be translated into an AMICAL compatible VHDL 
input description, the following restrictions apply: 

• The VHDL ^entity" is restricted to the declaration of I/O signals. 

• AMICAL does not accept the following types; physical floating point, 
enumeration, arrays and records. 

• Operators are permitted under the assumption that there are 
functions(functional units) able to execute them. 

• VHDL description using only ”wait until... " statements. 

• The "wait for time” statement is in fact ignored by AMICAL. To consider 
time delays a possible solution consists in adding a clock signal, to start the 
timer and wait for the time-out signal. 

• Each architecture must contain only one concurrent statement and this 
statement must be a "process ..." statement. 

• It is not possible to use a signal assignment with time expression. 
Consequently, one should activate first the timer, use a "wait" statement 
associated to a time-out signal and then use signal assignment. 

Further restrictions apply to the use of the following Estelle signal types: 

•real, record, pointer, aliases types and recursion are not allowed; 

• only one dimensional arrays are allowed; 
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The above rules enable the generation of feasible Estelle- VHDL mappings, 
which means that to each Estelle** specification there exists a semantically 
equivalent description in VHDL* that is accepted by the AMICAL HLS tool. This 
do not implies a one-to-one correspondence between the subsets of each language. 
In fact, there are many possible VHDL translations for the same Estelle 
specification. The choice of a suitable VHDL description style for protocols is an 
important issue when the HLS is sought, since this affects directly the structure of 
the synthesized hardware (Bhasker, 1990) as will be seen in the following 
discussion. 

3 DESCRIPTION STYLES FOR PROTOCOLS IN VHDL 

A protocol may generally be described by many different VHDL behavioral 
description styles. These styles provide semantically equivalent descriptions that 
differ only in the quality of the HLS result. The quality of this HLS output 
structural description is measured by the number of states and paths generated by 
the scheduling process performed by the HLS tool. It will be shown that the quality 
of the HLS output description will depend on the type of statements used and on 
the order in which they appear in the input description. 

To illustrate this fact, consider a protocol modelled by a state machine that 
performs handshaking operations with another state machine as shown on Figure 
3. This generic protocol has two phases: the initialization phase and the protocol 
processing phase. The last one consists of a loop that waits for an input signal, 
process the input data and outputs the corresponding result, through a signal. In the 
processing phase one may further distinguish three steps: data input, treatment and 
output, assuming no data dependencies in the model. 

This model will be used to illustrate the several description styles and the 
consequences on the HLS results. The main tasks involved in HLS are scheduling 
and allocation. The analysis will be restricted to the scheduling step which is 
responsible by the complexity of the controller. 

The First Description style for implementing a state machine associates a VHDL 
"wait until condition" statement to each branch of a Estelle "case state" statement. 
The second description style places the "wait until inputjcondition" statement 
outside the "case" statement. The Third employs an "if ... then ... elsif ... end if 
structure instead of a "case state" statement inside a "process" declaration. This 
construct uses the state as the outer variable. A "wait until inputjcondition" 
statement is placed outside the nested if clauses. The Fourth also employs an "if 
... then ... elsif... end if structure but with the input as the outer signal. The Fifth 
alternative for implementing a state machine is a variation of the second one; the 
different output control signals are settled to zero immediately after the "case 
state" statement. Figure 3 shows the VHDL description and the corresponding 
CFG resulting from the Fifth style. Table 2 shows the state table resulting from the 
scheduling for this description using the Dynamic Loop Scheduling algorithm. 
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1 process 

2 begin 

3 -El 

4 MEF: loop 

5 proximo <= ‘1’; 

6 wait until inputx= 1 or inputy= 1 ; 

7 proximo <= ‘O’; 

8 case state is 

9 when statex => 

10 if inputx = ‘1’ then 

11 -E2 

12 outx<= ‘1’; 

13 endif; 

14 when state y => 

15 if inputy = ‘1’ then 

16 - E3 

17 outy<=‘l’; 

18 endif; 

19 end case; 

20 wait until rising_edge(clk); 

21 outx<=‘0’; 

22 outy<=‘0’; 

23 end loop; 

24 end process; 



Figure 3 Control flow graph and VHDL process for the Fifth Description Style 
Table 2 Transition Table of the Fifth Description 



Transition 


State 


Operation 


Next 


Condition 


1 


SI 


3,5 


SI 


— 


2 


S2 


7,11,12 


S2 


state = statex and inputx =1 


3 


S2 


7 


S2 


state = statex and inputy =1 


4 


S2 


7,16,17 


S2 


(state /= statex and inputy =1 


5 


S2 


7 


S2 


state /= statex and inputx =1 


6 


S2 


... 


S2 


(inputx /=1) or (inputy /= 1) 


7 


S3 


21,22,5 


S2 


... 



3.1 Example of Application 

The five description styles discussed previously were used, as an example, to 
describe the Signaling state machine of the ABRACADABRA_HS protocol 
defined on (Pirmez, 1995). The resulting descriptions were synthesized by 
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AMICAL using the DLwS algorithm. The DLS algorithm is optimized (Rahmouni, 
1994) for the treatment of controMlow dominated descriptions. It is based on the 
same principles of path scheduling (Camposano, 1991), but significantly reduces 
the number of the generated paths and hence the computation cost. The scheduling 
results obtained by the DLS algorithm, for each description style, are displayed in 
Table 3. 

The following information is displayed in the table: the number of paths, the 
number of states generated by each description, the number of operations, the 
number of waits, the number of if s, the number of cases, the number of VHDL 
lines of code in the input description and an estimate of the area of the controller, 
in number of transistors. 

Table 3 DLS Results 



Description 


Path 


States 


Operations 


Wait 


Ifs 


Case 


Lines 


Area 


First 


614 


88 


400 


54 


34 


1 


653 


206.183 


Second 


138 


70 


406 


45 


42 


1 


622 


46.473 


Third 


136 


69 


413 


44 


51 


0 


625 


45.904 


Fourth 


137 


69 


451 


44 


52 


1 


616 


46.237 


Fifth 


104 


36 


325 


11 


42 


1 


557 


35.641 



It can be observed from Table 3 that globally the Fifth Description presented the 
best results. It significantly reduced the number of ”wait until inputjcondition” and 
the number of ”wait until risingjedge (elk)” statements. As a consequence, the 
number of paths generated is reduced and, hence, the computational cost. The DLS 
algorithm generates a new path each time a ”wait until ..." statement is found thus, 
the number of wait’s is the main cause of paths generation in the algorithm. For 
this reason the First Description style presents the worst results. The other 
description styles significantly reduce the number of ”wait until inputjcondition” 
reducing considerably the generation of paths. The Second and Third Description 
styles produce similar results since the ”if ... then ...elsif ... endif and ”case ...” 
statements have, in fact, similar structures. For this particular example the results 
show a similar behavior of the Third and Forth styles. But in general one can 
expect better results from the Third Style when the protocol processing is more 
dependent on the number of states than on the number of inputs, reducing, 
accordingly, the number of generated paths and operations. On the opposite case, 
one should expect a better performance for the Forth style. 

As a consequence of the previous discussion, to produce efficient protocol 
descriptions for HLS, the following general procedure should be adopted : 

• The FSM protocol model corresponds to a VHDL process. The process is 
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composed of two loops: an external loop that models the restart of the process 
and an internal loop that models the FSM of the protocol. 

•The ”wait until ... " statement is combined with the inner loop where the control 
signals are checked. 

•In order to avoid duplications in the code when specifying exceptions, the 
exceptions signals (reset, error situations) should be created and added to the 
sensitivity list of the wait statement. Thus, the exceptions signals are 
immediately checked after this wait statement. 

•A state machine can be built using a "case ..." or "if ... then ... elsif ..." 
statement. In both solutions, the decision to test initially the state or the input 
will depend on the processing. That is, if it is more dependent on the state 
than on the input or vice-versa. 

•After a ”case...” or after an outer ”if ... then ... elsif ... end if statement, the 
output control signals should be set to zero. 



4 FINAL CONSIDERATIONS 

A Formal Description technique as Estelle is a natural choice to the protocol 
designer when specifying and analyzing communication protocols. On the other 
hand, VHDL is the best choice for dealing with run time problems and interfacing 
with synthesis tools. This work presented a translating strategy for HLS using the 
VHDL based AMICAL environment. Initially, the Estelle constructs were 
analyzed by report to their equivalence to VHDL. A simplified version of Estelle, 
the Estelle**, was proposed to simplify the translation to VHDL. 

Five alternative VHDL description styles for FSM's that can be used to describe 
a protocol and the constraints that determine the most suitable style to synthesis 
tools are discussed. 

Finally, the proposed methodology was applied to an example, the high 
performance ABRACADABRA-HS protocol. The conclusions drawn from the 
ABRACADABRA_HS protocol example can be extended to any system modelled 
by a FSM. 
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Abstract 

We present Matisse, a concurrent object-oriented system specification lan- 
guage, well-suited for protocol processing applications used in telecom net- 
works. An industrial application used in ATM networks is introduced. From 
this case study, we derive the requirements that must be supported by Ma- 
tisse. Matisse is the entry point for the methodology presented in [6], that 
bridges the gap between system specification and synthesis tools commercially 
available. In contrast to the system specification languages currently used in 
industry, Matisse is implementation-independent and permits the exploration 
of different embedded hardware/software realizations. 



Keywords 

System specification, hardware/software codesign, protocol processing appli- 
cations, system synthesis and object-oriented languages. 



1 INTRODUCTION 

Modern telecom systems are rapidly increasing in design complexity. Telecom 
network applications include protocol processing systems for broadband net- 
works [9], wireless infrastructures, and interactive video-on-demand servers. 
Currently, protocol processing applications are partitioned in hardware and 
software components that are designed separately, which often introduces 
specification and implementation mismatches that are only detected at the 
final design stages. System integration and test phases can take nearly 50% 
of the complete design cycle for typical protocol processing applications. Soft- 
ware components are usually specified using SDL [15]. C or C-1-+ code is 
then generated and compiled to machine instructions for the target processor. 
Run-time support is added for managing concurrency and interprocess com- 
munication [18]. Hardware components are specified at the register transfer 
level, using VHDL or Verilog, where detailed clock cycles, and specific archi- 
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tectural decisions are already fixed. This level of specification is often hard to 
read, modify, and reuse. Small changes at the system level often require very 
substantial changes in the components specification. 

There are several approaches focusing on system design for embedded hard- 
ware/software. The hierarchical FSM model is a powerful formalism for re- 
active control behaviors, but it does not support well abstract data struc- 
tures and object-oriented features. Heterogeneous design environments, like 
Ptolemy [3] and CoWare [1, 2], aim to provide an open environment to inte- 
grate different models of computation. Most system-level research and CAD 
innovations today are focussed on Digital Signal Processing (DSP) applica- 
tions (e.g. [3, 10, 11]). Commercial tools include SPW from Alta/Cadence, 
COSSAP from Synopsys and DSP Station from Mentor. 

In contrast to DSP applications, protocol processing applications are dif- 
ferent in nature. In particular, protocol processing applications require ma- 
nipulation of complex data structures that are often dynamically created and 
destroyed at run time, as opposed to the signal flow present in DSP applica- 
tions. DSP models are not well-suited for control-dominated data processing 
behaviors found in protocol processing applications that heavily rely on tight 
interactions between control-flow algorithms and stored data structures. Due 
to many differences in nature between these application domains, system mod- 
els should be domain-specific. 

Distributed programming languages have been proposed for programming 
general-purpose multiprocessor systems or distributed networks of worksta- 
tions [4, 5, 12, 16]. While their underlying models are related to our Ma- 
tisse model, their implementation targets are different: they rely on elaborate 
run-time environments and are intended for pure software implementations. 
In contract, our implementation target is intended for optimized embedded 
single-chip hardware/software realizations. 

Protocol processing systems are extremely complex and they must be mod- 
eled, debugged and simulated at a high level of abstraction before proceeding 
to implementation. In this paper, we present Matisse, a system specification 
language that supports high level specification of protocol processing appli- 
cations. The remainder of this paper is organized as follows. In Section 2, 
we present an actual application example in order to derive the requirements 
that must be supported by our system specification language. In Section 3, 
we describe the Matisse language in a detailed way. In Section 4, we overview 
how Matisse fits in a hardware/software codesign flow and how Matisse can be 
used for design exploration. Finally in Section 5, conclusions are presented. 



2 SYSTEM SPECIFICATION 

This section presents an actual protocol processing application used in telecom 
networks. This case study is used to derive requirements to be supported by 
the Matisse language. 
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2.1 Case study 

ATM [14] is a fast packet-switching transfer mode that supports high-speed 
integrated services by splitting all communication messages into equal 53- 
byte cells, called ATM cells. These cells can carry any kind of information, be 
it computer data, video, or voice. By using small cells to transfer data, the 
technology enables networks to support a wide variety of traffic, ranging from 
high to low band widths, and from bursty to steady bit rates. 

One representative case study is an user transparent connectionless router 
called Alcatel Connectionless Transport Server (ACTS) [17] that provides the 
necessary functions for the direct provision and support of data communi- 
cation between geographically distributed computers or between LANs over 
an ATM based broadband network. In its current implementation, the ACTS 
consists of several boards, each one consisting of several processors and co- 
processors (implemented as custom ASICs) and a programmable supervising 
microprocessor for executive control. 

A concrete example of one of those ASICs, named Segment Protocol Pro- 
cessor (SPP), is used to demonstrate the requirements to be supported by the 
Matisse language. The algorithms, implementing the SPP functionality, make 
use of stored complex data structures, shown in Figure 1. The right-hand side 
of the figure shows a FIFO, where incoming user cells are buffered. Packet 
records are accessed through two levels of tables with the local (LID) and 
multiplexing (MID) identifiers. A packet record contains various fields, such 
as the number of cells received so far, the time the first cell was received and 
a pointer to a list of routing records. 

These algorithms can be described as a set of tasks that cooperate with 
each other through the shared data structures shown in Figure 1. The tasks 
performed by the SPP are now briefly described: 

1. Data In processing Process an incoming user cell. Look up the packet 
record, or allocate a new one if the cell is the first cell of a packet. Perform 
various checks and update various fields in the packet record. Store the user 
cell in the FIFO. 




Figure 1 SPP stored data structures 
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2. Ingress Screening Request (ISR) generation Check the residency 
time of the cell in the FIFO. If the residency time is larger than a software 
defined threshold, generate an ingress screening request for the coprocessor. 
Ingress screening is needed for virtual private networks. It consists of checking 
whether the destination subscriber is a member of the closed group that the 
source subscriber is in. The processor which performs the ingress screening is 
also responsible for sending a routing request to a router in the network. 

3. Routing Reply (RR) processing Input a routing reply, look up the 
packet record for which the routing information is meant, generate a routing 
record for the packet record, and update the state of the packet in the packet 
record. 

4. Data Out processing Check the residency time of the cell in the FIFO. 
If the residency time is larger than a certain threshold, forward the cell on the 
network. If the cell is the last cell of a packet, deallocate the packet record. 

5. Time Out processing This task is necessary to deal with situations where 
the last cell of a packet has been discarded somewhere in the network, so that 
the packet record is never deallocated. The task consists in inspecting the next 
packet record in the memory region allocated for packet records, checking the 
lifetime, and deallocating the record if the lifetime exceeds some threshold. 

6. MID deallocation request generation Read the next packet in the 
packet record deallocation list, and generate a MID deallocation request for 
the packet, so that reserved bandwidth can be deallocated at the destination. 

Others ASICs used in the ACTS, such as the Packet Handler Processor 
and the Preventive Congestion Control processor, may be described in a sim- 
ilar way, by means of a set of cooperative tasks operating on shared data 
structures. 



2.2 Requirements 

The requirements to specify protocol processing applications such as the case 
study previously described are now presented. 

Protocol processing applications are data and memory intensive systems. 
They are conceptually seen as sets of concurrent tasks for accessing data. 
Therefore data have to be considered as stored objects from the beginning. 
Concurrency tends to be at the task level and is usually coarse to medium 
scale. However in today’s design practice, due to a lack of a concurrent system 
specification language and an appropriate system design fiow, conceptually 
concurrent tasks are implemented as interleaved consecutive tasks, without 
exploiting the nature of the concurrency itself. 

Although the target implementation of a protocol processing application 
is often a mixture of software and hardware processors, protocol processing 
applications are best-suited to be conceived at the top level from a software 
perspective and advantages such as fast simulation and earlier design valida- 
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tion can be obtained. Control constructs, such as if-ihen~else^ for and while 
loops^ are essential for capturing the algorithmic behavior of each task. Each 
of the SPP tasks may be described as a sequential program using the usual 
control constructs available in languages such as C++. 

While an object-oriented model is not necessary, it has been proven suc- 
cessful in the software design community, and it also plays a central role 
in large scale hardware/software system design. Object-oriented languages 
Object-oriented languages support data abstraction, encapsulation, polymor- 
phism, function overloading and inheritance, which are invaluable features in 
any large scale development. With these abstraction facilities, implementation 
decisions and low-level specification details can be hidden or easily updated, 
allowing easy and fast design explorjition. For instance, shared data structures 
may be initially specified as abstract data types, that will be refined later in 
the design flow. 

Concurrent object-oriented models are well-suited to model protocol pro- 
cessing applications, since they are intended to model concurrent computa- 
tions to be executed on more than one processor. Objects can encapsulate 
tasks as well as (shared) data structures. Remote procedure calls can encap- 
sulate interprocess communications. The system specification language must 
also: reflect the conceptual partitioning of the system, seen as a set of concur- 
rent tasks for accessing data, be independent from the final implementation, 
be manipulatable to permit easy updating and efficient design exploration, 
and be easily retargetable to different embedded hardware/software realiza- 
tions. This contrasts with current system specification practices, which are 
using VHDL for specifying the hardware processors, and C/C++ for specify- 
ing the software processors. 



3 MATISSE LANGUAGE 

From the previous requirements, we decided to follow an object-oriented ap- 
proach for the system specification of protocol processing applications. There- 
fore, the Matisse language is extended from the widely used object-oriented 
programming language C++. We introduce minimal syntactic extensions to 
C++ to allow the description of concurrent tasks, communication and syn- 
chronization among them. Compatibility with C++ enables new users already 
familiar with C++ to be productive in a very short amount of time. Also, ex- 
isting debugging and compiling tools can be easily adapted for early functional 
validation of the system specification. Finally, this enables us to leverage on 
the wide corpus of existing software compilation and runtime support tools 
for our software implementation path. This is important since software imple- 
mentation represents a substantial part of many of our target applications. 

There are many concurrent object-oriented programming languages ex- 
tended from C++. All of them are intended for specifying systems consisting 
of concurrent programs, running on a network of workstations. Compilers 
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for those languages generate C++ programs with calls to an elaborate run 
time Operating System (OS) designed for software processors only. Matisse is 
intended for specifying systems at the chip level instead. These systems con- 
sist of concurrent processes, running on a mixture of embedded software and 
hardware processors, each one with its own ultra-light OS. To be efficiently 
implement able in both hardware and software processors, these OS should of- 
fer only minimum support for task scheduling, interprocessor communication 
and synchronization. 

More precisely, the Matisse language uses some of the high-level abstrac- 
tions existing in Compositional C++ (CC++) [5]. CC++ is a concurrent 
object-oriented language extended from C++ using only a few new keywords. 
Simplifications were brought, taking into account the requirements introduced 
in Section 2.2. Also systems specified with Matisse must be synthesized into 
a hard ware /soft ware codesign at the chip level. 

CC++ allows the user to specify concurrency at all levels, from fine grain to 
task level concurrency. In protocol processing applications, the user needs to 
specify concurrency only at the task level. Thus, Matisse allows the specifica- 
tion of concurrency only at this level. In CC++, both thread and local virtual 
memory space concepts are separated. To model tasks only created at compile 
time, Matisse allows to create active objects at compile time. These objects 
encapsulate together a local virtual memory space and a default thread of 
control, that is initiated at the creation of the active object. Due to these 
restrictions, run time support can be majorly reduced. 

Similarly to CC++, communication between tasks is abstracted, without 
explicit specification of communication channels and an RPC mechanism is 
used to implement it. In CC++, data may be remotely accessed directly. In 
Matisse, data inside an active object are remotely accessed only through a ref- 
erence to the active object itself. Due to the simplified communication mech- 
anism, instead of providing two synchronization mechanisms as in CC++, 
Matisse only needs one synchronization mechanism. 

Now the different concepts in Matisse are explained in more detail. The 
concepts are illustrated using simplified code for the SPP. 



3.1 Passive and active classes 

In Matisse, two types of classes are distinguished: active and passive. 

A passive class is identical to a C++ class. For example, the packet record 
of the SPP application is a Matisse passive class declared as follows: 

class packet.record { 
int f ieldl ; 

boolean field2; 
packet. record* next; 

}; 
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Instances of a passive class are called passive objects, and are identical to 
C++ objects. Passive objects may be created and destroyed either at compile 
time or at run time. 

An active class in the SPP application is the ’’Data In processing” task and 
it is declared as follows: 



active class data-in { 
cell. record* cell; 
packet. record* packet ; 
public : 

data.in () ; 

void body (packet. record.mngr *global pr, 



cell.fifo.mngr *global 
input *global input) { 
cell = input->get() ; 
switch (cell->type()) { 
case BON: 

packet = pr->alloc(); 
//SOME BEHAVIOR 
pr->put (packet) ; 
cf->enqueue(cell) ; 
case CON: 

packet = pr->get(); 

// SOME BEHAVIOR 
pr->put (packet ) ; 
cf->enqueue(cell) ; 

} 

>; 

}; 



cf, 

// get a cell from the input 

// cell is Begin Of Message 
// create a new packet record 

// store packet info 
// store cell in the fifo 
// cell is Continuation Of Message 
// use an existing packet record 

// store packet info 
// store cell in the fifo 



Any instance of an active class is called an active object. An active class is 
identical to a passive class, except that each active object has its own local 
virtual memory space and may have its own default thread of control. This 
thread is then initiated at the creation of the active object. It is specified by 
a special public member function of the active class, called body. 

In contrast to passive objects, active objects may only be created at compile 
time, to avoid creation of new threads at run time that may be difficult to 
implement on a hardware processor. 

Active classes can inherit from base active classes too, and the usual C++ 
protection mechanisms apply. So private data elements and member functions 
of an active class can be used only by the member functions of it. Public data 
elements and member functions constitute the interface to the active objects 
of the active class. 



3.2 Concurrency at the task level 

In a typical Matisse program, the number of active objects is small, compared 
to the number of passive objects. The passive objects exist as data elements 
of active objects. Active objects are only created in the main function, which 
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yields the concurrent initiation of their bodies. Hence the main function pro- 
vides the task-level concurrent structure of the Matisse program. 

In a simplified SPP version, the main function first creates four active ob- 
jects and then initiates their bodies that will run concurrently: 

int main (int argc, char**argv) { 

packet.record.mngr* global pr; // shared data 

cell.f ifo^mngr* global cf; // shared data 

data.in* global di; // task 

data-out* global do; // task 

pr = activenew packet.record.mngrO ; 
cf = activenew cell.f ifo.mngrO ; 
di = activenew data_in(pr ,cf ) ; 
do = activenew data_out (pr ,cf ) ; 

} 

Currently we restrict the structure of the main function, which may only 
consist in creating a set of active objects. The keyword activenew, taken from 
UC-f + [13], is used to specify the creation of an active object. The semantic of 
activenew is: create an active object, using the C++ new function, execute the 
constructor of each active object and then start all the body member functions 
of the active objects concurrently. Note that the ‘bodies’ are wrapped in an 
infinite loop. 



3.3 Communication 

Accessing data elements within an active object is regarded as local and hence 
cheap. A thread executing in an active object can access its data elements 
directly, by using C++ pointers. Active objects can be accessed by each other 
using global pointers. Except for their potentially higher cost of use, global 
pointers are used just like C++ pointers. 

Inside a thread, computation can be executed in another active object via 
a Remote Procedure Call (RPC), as follows: 

X ♦global gp; 
result = gp“>p(a,b,c) ; 

where gp is a global pointer to an active object of the active class X, 
p(a,b,c) is a call to a member function p() defined in the active object refer- 
enced by gp, and result is a variable set to the value returned by p(a,b,c). 

An RPC proceeds in three stages: first the arguments of the function p( ) are 
packed into a message, communicated to the remote active object, unpacked 
and the calling thread suspends execution, then a new thread is created in 
the remote active object to execute the called function and at last upon ter- 
mination of the remote function, the function return value is transferred back 
to the calling thread, which resumes execution. 
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Data can be communicated between active objects, by using global pointers, 
as follows: 



int lenl ; 

A ^global gp; //A is an active class with data element len2 

lenl = gp“>len2; // reading data from another active object 

gp->len2 =5; // writing data to another active object 

Reading and writing global pointers must be used with caution, since it 
involves two communications: one to send the read or write request, and one 
to return a result or a completion signal. 



3.4 Synchronization 

Due to concurrent computations, several accesses to data elements or member 
functions in an active object can occur simultaneously. Matisse provides one 
method for controlling the order in which things happen, by using atomic 
functions. 

The use of atomic functions is illustrated in the following example: 



active class packet. record.mngr { 
packet.record «head, *tail; 
public : 

packet.record.mngr (); 
atomic packet.record* alloc (); 
atomic packet.record* get (); 
atomic void put (packet.record*) ; 

}; 



// initialize head and tail 
// create a new packet record 
// get a packet record from the list 
// put a packet record in the list 



Whenever several threads are calling an atomic function, this atomic func- 
tion is executed the required number of times in a sequential order. Also 
the execution of an atomic function never interleaves with the execution of 
another atomic function of the same active object. This concept of atomic 
function is based on the monitor concept introduced by Hoare in [8]. 

In order to avoid deadlocks, some rules for defining atomic functions must 
be followed, such as: the body of an atomic function must terminate in a 
finite time, implying that it may not do an RPC, that it may not call other 
atomic functions of its class, and that a body function running for ever may 
not be declared atomic. Member functions may be declared atomic in both 
active and passive classes. 

Using atomic functions instead of atomic objects helps the user to spec- 
ify critical sections that must be as short as possible. In an object-oriented 
approach, each object (either active or passive) is responsible for its own pro- 
tection. In Matisse, this is still valid, but deciding which member functions 
have to be declared atomic is currently left to the user. 
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3.5 Shared Data structures 

Each active object represents one local virtual memory space with a default 
thread executing in it. Passive objects are* never shared between active objects. 
If a passive object needs to be shared by several active objects, the user has 
several options. He can, for example, specify typical active objects with no 
body function and whose data elements are those passive objects to be shared. 
In this way, the active object is a kind of memory manager for passive objects. 
For instance, in the main function of our example, introduced in Section 3.2, 
pr is such an active object, pr is an instance of the class packet. record^mngr, 
declared in Section 3.4, and it is used to manage packet records. 



4 SYSTEM DESIGN FLOW 

The Matisse language, described previously, is used as input to the system de- 
sign flow [6] depicted in Figure 2. Matisse is also used for functional validation 
and it facilitates design exploration due to its high level of abstraction. 

The system design flow starts from an initial concurrent object-oriented 
speciflcation within the Matisse language and targets an heterogeneous imple- 
mentation of software and hardware processors. The Matisse program^ using 
abstract data types, as sets, collections of data, and association tables, spec- 
ifies the system to be described. This program can be executed, allowing 
functional validation and debugging. 

The Matisse program is internally represented as a network of communicat- 
ing processing objects, managed by an ultra-light OS. This internal representa- 
tion allows efficient system design exploration and it is still independent from 
the final HW/SW realization. Refinement and optimization consists in: refin- 
ing abstract data types into efficient complex data structures [20], generating 
memory management of these complex data structures, which are dynamically 
allocated and deallocated by concurrent processes [7], optimizing memory ac- 




Figure 2 Matisse Design Flow 
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cess yielding ordering of concurrent threads [19], and exploration of different 
embedded HW/SW realizations, based on interprocess communication costs. 

System architecture generation consists in allocating a number of hardware 
and software physical processors and mapping the internal representation of 
Matisse into the target architecture. Communications between active objects 
assigned to the same physical processors are refined into intraprocessor com- 
munications. Communications between active objects assigned to different 
physical processors are refined into interprocessor communications. 

Software processor synthesis consists in generating the complete specifica- 
tion of each software processor, so that synthesis is made possible by us- 
ing traditional software design tools for code generation. Hardware processor 
synthesis consists of memory synthesis, that generates a distributed shared 
memory architecture, and VHDL code generation, so that synthesis is made 
possible by using traditional hardware/behavioral synthesis tools. CoWare 
[1, 2] is used for interprocessor communication synthesis. 

5 CONCLUSION 

In this paper, we have addressed the system specification problem for protocol 
processing applications used in telecom networks. We introduced the SPP, an 
industrial example that demonstrates the main requirements to model such 
applications at the system level. We presented Matisse, a system specification 
language extended from C++. The concepts present in Matisse were shown 
sufficient to specify the SPP. Currently, we are evaluating and refining our de- 
sign flow using this case study. At the moment, the Matisse compiler generates 
an abstract machine that can be executed using the CoWare environment. 

Using Matisse and the proposed system design flow, the user is able to: write 
a system specification, independent from the final implementation, and easily 
retargetable to different embedded hardware/software realizations; validate 
functionally the specification and explore the design space at system level. 

In the near future, we want to show the suitable applicability of our concur- 
rent object-oriented approach on other actual telecom applications. We are 
also investigating on how to include timing constraints in the system specifi- 
cation and support them through the system design flow. 
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Abstract 

This paper presents an efficient method for mapping a set of Boolean equations 
onto a set of Static CMOS Complex Gates (SCCGs) under a constraint in the 
number of serial transistors. This Library Free Technology Mapping (LFTM) 
approach uses a virtual library of SCCGs available through a layout generator, 
instead of using a limited set of pre-characterized cells. Our goal is to use a 
virtual library of SCCGs to perform the mapping at the transistor level, in order 
to fit the topological constraints imposed by the CMOS technology. Limitations 
of previously proposed techniques to perform Library Free Technology Mapping 
are discussed. Tbe proposed method, based on an one-to-one association of 
CMOS transistors with Binary Decision Diagram arcs, is not dependent on the 
initial ordering of Boolean equations. Experimental results comparing this 
technique to previously published ones indicate that it generates good-quality 
solutions. 
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1 INTRODUCTION 

The automatic synthesis of integrated circuits involves the generation, treatment 
and optimization of several intermediate descriptions at different levels of 
abstraction. At the back end of this process, it is necessary to choose the physical 
elements that will be used to implement the final layout. This task is usually 
known as technology mapping and it was pioneered in the works of Keutzer 
(1987) and Detjens (1987). Formally, technology mapping is the task of mapping 
a set of Boolean equations onto a set of given physical elements to minimize a 
total given cost function. In a typical application, the set of Boolean equations 
does not have any technology information and comes from a logic synthesis tool. 
The set of physical elements is normally a library of pre-characterized cells for a 
given target technology. Each cell is associated with some technology specific 
information such as its area, delay, power, load and drive capabilities. The 
technology mapping tool chooses a set of cells from this library in order to obtain 
a circuit equivalent to the original Boolean description. The goal of this process is 
globally minimize some of the technology dependent cost functions (area, delay 
or power). Obviously, the quality of the final set of physical elements depends on 
the quality of the initial set of Boolean equations and on the quality of the library, 
as well as on the ability of the mapping tool to use the cells available in the 
library. 

This paper presents an efficient method of mapping a set of Boolean equations 
onto a set of Static CMOS Complex Gates (SCCGs) under a given constraint in 
the number of serial transistors. This approach is implemented in a tool called 
TABA and it is conceived to be in connection with a leaf cell generator. Our goal 
is to use a virtual library of SCCGs (available through the leaf cell generator) and 
to perform the mapping at the transistor level, in order to fit the topological 
constraints imposed by the CMOS technology. 

This paper is organized as follows: Section 2 details the formulation of the 
Library Free Technology Mapping (LFTM) problem and discusses the limitations 
of previously proposed approaches. Section 3 presents our approach to Library 
Free Technology Mapping . The validation of the proposed method is given in 
section 4, where the performances obtained by TABA on a set of benchmarks are 
compared to previously published results. Finally, we conclude on the benefit of 
using LFTM and on its application to circuit design. 



2 FORMULATION OF THE PROBLEM 

LFTM relies on the extensive and efficient use of a large virtual library of SCCGs 
instead of using a pre-characterized library. Let us discuss some topics and 
present terms related to this problem. 
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Figure 1; SCCG(2,2) virtual library. 

2.1 Static CMOS complex gates 

A Static CMOS Complex Gate (SCCG) is composed of two complementary 
serial/parallel networks of transistor switches connecting the output of the SCCG 
with the logic signal sources. The network between the output and the logic 1 
source (Pull-up network, or P plan) is composed of PMOS transistors while the 
network connecting the output to the 0 logic source (Pull-down network, or N 
plan) is composed of NMOS transistors. Each input of the SCCG controls a dual 
pair composed of a NMOS and a PMOS transistor. 

2.2 CMOS Virtual libraries 

We will denote by SCCG(n, p) the set of the SCCGs with a maximum of n(p) 
serial NMOS (PMOS) transistors, respectively. Note that a given limitation in the 
number of serial transistors induces a finite set of SCCG configurations to be 
available as a virtual library. From Detjens (1987), the cardinality of the set 
SCCG(4,3) is ISCCG(4,3)I=396. The library SCCG(2,2) is composed of seven 
gates ; a one-input gate (inverter), two two-input gates (2-input-nand and 2-input- 
nor gates), two three-input gates and two four-input gates, as shown in figure 1. 

2.3 Previous approaches to Technology Mapping 

Technology mapping needs the application of two distinct operations : matching 
and covering. Matching is the recognition of the equivalence between a portion of 
the initial network and library cells. Covering is the task of choosing the best set 
of cells whose interconnection represents the original circuit among the several 
possibilities returned by the matching phase. Now we will discuss the limitations 
of applying current techniques to perform LFTM. 
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Figure 2: Fan-in distribution of SCCG(4x4) virtual library. 

2.3.1 Pattern Matching 

Pattern matching techniques were used in the pioneering works of Keutzer and 
Detjens. The difficulty to apply pattern matching techniques to LFTM is that the 
number of patterns is greater than the number of SCCGs. DAGON applies pattern 
matching on a DAG reduced to a forest of trees. The number of necessary tree 
patterns is not linear with the library size, and 1171 different patterns are needed 
to represent a subset of SCCG(4,4) containing only 123 SCCGs Detjens (1987). 
Pattern matching is limited by the size of the library. 

2.3.2 Boolean Matching 

Pattern matching is unable to treat a great number of (logically) different cells. 
Recently Maillot (1993) proposed a Boolean matching technique in which the 
logic function of a sub-set of the subject description is compared with the logic 
function of a library element, by using ROBDDs canonicity (Bryant, 1986). Even 
if there is only one logic function for the element, the number of pin assignments 
to match a cell is exponential with its number of inputs. In (Maillot, 1993) the 
libraries were limited to gates with 1 1 or less inputs and in practice, only gates 
with a maximum of 6 inputs were used. Figure 2 shows the limitations of using 
only six-input cells with respect to the library SCCG(4,4). Boolean matching is 
limited by the size of the cells in the library. 

2.3.3 Dynamic covering 

The covering process is normally done by dynamic covering of a tree with trees. 
Dynamic covering gives the optimal solution to the problem. Unfortunately, this 
is partially true, because the optimal solution of the covering problem depends on 
the initial ordering of the tree. Figure 3 shows optimal covers of two trees 
representing the same logic function (the cost function is the number of cells 
from SCCG(3,3), disregarding inverters). 
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(a) Bad initial ordering (b) Good initial ordering 

Figure 3: Dynamic covering of a tree with trees. 

2.3.4 Greedy structural covering 

The method proposed in (Abouzeid, 1992) is also based on tree representations. 
Anyway, there is no need of recognizing patterns because the use of inversions is 
limited to root and leaf nodes, allowing easy calculation of the number of serial 
transistors. The approach is based on decomposing trees into sub-trees by locally 
taking into account the number of serial transistors in the sub-trees. The method 
proposed in (Berkelaar, 1988) is a similar method based on a graph 
representation. The main problem of the greedy structural approaches is that only 
the local limitation of serial transistors is considered while constructing a SCCG 
structure. 



3 LIBRARY FREE TECHNOLOGY MAPPING 

This section presents our approach to the Library Free Technology Mapping 
(LFTM) problem. Sub-section 3. 1 reviews some basic concepts about BDDs and 
introduce TSBDDs. Our algorithm is presented in sub-section 3.2. 

3.1 Binary Decision Diagram basics 

Terminal suppressed BDDs (TSBDDs) are briefly presented in (Reis, 1995). 
Bryant has introduced ROBDDs as a canonical form to represent Boolean 
functions, by proving that there is a unique correspondence between a ROBDD 
and a Boolean function for a given variable ordering (Bryant, 1986). TSBDDs are 
appropriate for CMOS technology mapping because a TSBDD will always match 
a SCCG and the transistor topology of the PMOS and NMOS transistor networks 
can be obtained by isomorphism with BDD arcs. 
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A Binary Decision Diagram (or BDD) is a rooted directed acyclic graph whose 
vertex set is V. Each non-leaf or non-terminal vertex has as attributes a pointer 
index(v) e {1, 2, n) to an input variable in the set {xy, x2, %}, and two 

children low{v), high(v) € V. A leaf or terminal vertex v has only one attribute 
value value(v) e B, where B is the binary set {0,1}. In this paper, we say that 
every arc whose head is v and whose tail is low(v) is a SO arc. Also, every arc 
whose head is v and whose tail is high(v) is a SI arc. 

An OBDD is a BDD where any non-terminal vertex pair {v, low(v)} and {v, 
high(v)} obeys the variable ordering condition index(v)<index(children(v)). 

A ROBDD is an OBDD where there is no vertex v with low(v)=high(v), nor 
any pair (v, «} such that the sub-graphs rooted in v and in u are isomorphic. In 
order to get a ROBDD from an OBDD, it is necessary [6]: 

- to eliminate all the redundant vertices whose two edges point to the same 
vertex, 

- to share all the isomorphic sub-graphs. 

Now let see the properties of the two main data structures used in this work, the 
VPBDDs (Vertex Precedent BDDs) and the TSBDDS. The two main differences 
with respect to BDDs is that VPBDDs and TSBDDs vertices can have negative 
indexes to represent negated (inverted) variables and also they have a vertex 
precedence index vp(v) to represent the order the vertices appear in the graph. 

A VPBDD (Vertex Precedent BDD) is a BDD where: 

1) there is a path that passes through all non-terminal vertices, i.e. either 
vp(low(v))=vp(v)+l or vp(high(v))=vp(v)+} . This condition is called vertex 
precedence. 

2) there is no vertex v with low(v)=high(v). 

3) the graph is serial/parallel, i.e. for each node v, the nodes av above v 
(defined by vp(av)<vp(v)) cannot have vp(low(av)) neither vp(high(av)) between 
the interval ]vp(v), max(vp(low(v)), vp{high(v)))[. 

A TSBDD is defined as a VPBDD that has the following four properties: 

1) Only SO arcs connect the 1 -terminal vertex, and they are suppressed without 
loosing information. 

2) Only SI arcs connect the 0-terminal vertex, and they are suppressed without 
loosing information. 

3) All arcs having the same tail are either SO arcs or S 1 arc. 

4) There is a path that passes through all non-terminal vertices (vertex precedence 
condition, as for VPBDDs). 

5) The graph is serial/parallel, as for VPBDDs. 

In short, if a VPBDD obey to the previous four conditions, then it is a TSBDD 
and it is possible to associate CMOS transistors directly (one-to-one association) 
with its arcs to obtain SCCGs [7], as shown in figure 4. 
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transistors transistors 

Figu re 4: Generation of a SCCG from a TSBDD of the function 
Fl=A.(B+C)(D + E). 



3.1.3 Implementation issues 

The Class of BDDs is implemented in an hierarchical way. As they are 
serial/parallel graphs, the data structure is composed of a virtual class triangle 
having two derived classes: BDD_node and BDD. These classes use a common 
interface given by triangle class. A BDD is composed of a list of triangles, a 
parallel node pointed by all triangles in the list and a serial node pointed by one 
of the triangles of the list. At each level the sons are not allowed to have the same 
parallel node as their father. If this situation happens the son is deleted and it is 
substituted by its sons (and one hierarchical level disappears). 

3.2 Technology Mapping in TABA 

Our LFTM tool (TABA - Transistor Assignment with TSBDDs representing 
Static CMOS Complex Gates) is presented here. Figure 5 shows the steps of the 
algorithm through an example. The input is a Boolean network composed of 
simple gates (5.a) that is decomposed in terms of logic cones (5.b) where the 
gates are interconnected with fan-out 1. For each logic cone, three steps are 
performed. First, a TSBDD representing the logic cone is constructed (5.c), next 
there is a limitation in the number of serial transistors taking into account polarity 
assignment (5.d) and finally the SCCGs are generated by association of 
transistors with TSBDD arcs (5.e). The final result is CMOS network (5.f) 
containing SCCGs obeying the serial transistor constraint used in step 5.d. Next 
we will discuss these steps in more detail. 
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Figure 5: Steps of TABA technology mapping. 



3.2.1 VPBDD construction 

The first step in TSBDD construction is the creation of a TSBDD for each cell in 
the initial network. As the initial network contains only simple gates, this 
operation is trivial. Afterwards, for every two combinational cells interconnected 
with fan-out 1, the following procedure is performed: 1) The two cells are 
deleted. 2) A new VPBDD is created by collapsing the two VPBDDs of the cells. 
3) A new cell is created to encapsulate the VPBDD resulting from step 2. This 
procedure is repeated till no more combinational cells interconnected with fan- 
out 1 are found. The resulting structure is a set of VPBDDs interconnected with 
fan-out greater than one (otherwise they would be collapsed). In order to obtain a 
network of SCCGs under a constraint of serial transistors, two operations are 
need. First, the number of serial arcs must be limited to respect the constraint of 
serial transistors. Second, the polarity assignment must transform the VPBDDs in 
TSBDDs, in order to obtain a description that is realizable in CMOS technology. 
The quality of the final network will depend on the number of cuts inserted to 
limit serial arcs of the original VPBDD, as well as on the number of inverters 
used to obtain a polarity assignment realizable in CMOS technology. 





Library free technology mapping 



311 



3.2.2 Limitation of serial transistors 

Fortunately, the number of serial arcs obtained when transforming a VPBDD in a 
TSBDD does not change with the chosen polarity. Polarity assignment will only 
change the plan where the transistors will be in serial/parallel. When using 
libraries that are closed under inversion, like SGGG(2,2), SCCG(3,3) and 
SCCG(4,4) these two problems can be treated separately. This sub-section 
introduces an algorithm that minimizes the number of cuts inserted to limit the 
number of serial arcs, while the next sub-section will treat about polarity 
assignment. 

The algorithm for serial transistor limitation is based on the way the number of 
serial arcs is calculated in a VPBDD. The VPBDD is hierarchical and there are 
two costs by level ; CS and CP, which represent the number of serial and parallel 
transistors, respectively. The final costs being CS and CP of the upper level. Leaf 
nodes have CS and CP set to 1 . 

The cost of non-leaf levels are calculated as follows: 

CSfather — ^ CP sons 

CP father = MAX(^CSsons^ 

The serial transistor limitation is performed in one level at a time. The algorithm 
descends till it finds a level where CS is greater than the serial transistor 
constraint and where all sons have CS and CP OK. Then the algorithm chooses a 
sub-set of the sons where the sum of CPs is the maximum value equal or smaller 
than the serial transistor constraint that it is possible to obtain. This sub-set is then 
substituted by a single node. The extracted subset obeys to the serial transistor 
constraint, and the process is repeated over the remaining VPBDD till it also 
obeys to the serial transistor constraint. As the sons can be selected in any order, 
the algorithm is not dependent on the initial order of the Boolean equation as it 
happens in dynamic covering. 

3.2.3 Polarity assignment and SCCG generation 

The first step of the polarity assignment algorithm is the assignment of a cost for 
each net in the network. Nets with high fan-out have a zero cost (because they 
will need buffer insertion, so the cost of getting the inverted signal is zero). The 
polarity of each cell is set in order to eliminate every inverter inside the logic 
cones, because the nets inside the logic cone are not charged nets. At the end of 
the process, the inverters will be at the borders of the logic cone, that are charged 
nets or inputs of the circuit, or both. If the cost of the cone is high, the polarity of 
every cell is changed and this will lead to a polarity assignment with lower cost 
and without inverters inside the cone. 

3.2.4 Design Environment 

TABA is conceived to be integrated in a design flow composed of a logic 
synthesis tool at the front end and a layout generator at the back end. At the 
moment, we are using SIS as the logic synthesis tool and TROPIC (Moraes, 
1993) as the layout synthesis tool. 
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Table 1: Comparison between Berkelaar (1988) and TAB A 



results from Berkelaar results from TABA(4,4) 



Circuit 


trans 


SCCGs 


invs 


trans 


SCCGs 


invs 


5xpl 


302 


31 


8 


302 


25 


26 


9sym 


620 


63 


22 


404 


30 


35 


bw 


532 


63 


22 


516 


59 


39 


duke2 


1224 


142 


53 


1110 


131 


92 


misexl 


132 


14 


5 


148 


13 


12 


rd53 


120 


8 


1 


82 


11 


6 


rd73 


274 


24 


10 


174 


19 


16 


rd84 


384 


40 


18 


290 


30 


25 


sao2 


486 


53 


22 


362 


32 


33 


Total 


4074 






3388 






%Total 


100% 






83% 







4 VALIDATION AND RESULTS 

Comparative results between TAB A and (Berkelaar, 1988) are shown in table 1. 
For a set of nine IWLS93 benchmarks mapped on SCCG(4,4) TABA reduces the 
number of transistors by an average of 17%. In the better case, TABA wins by 
36% and in the worst case TABA looses by 12%. The comparisons are only 
presented for the SCCG(4,4) because this is the only SCCG set used for the 
results presented in (Berkelaar, 1988). 

Table 2 shows comparative results between TABA and (Abouzeid, 1992). 
These benchmarks are first mapped on SCCG(3,3), because the results from 
(Abouzeid, 1992) use only this set. In the better case, TABA wins by 70% and in 
the worst case TABA looses by 8%. TABA reduces again the number of 
transistors by an average of 17% when compared to (Abouzeid, 1992). This 
average goes to 24%, when compared to TABA mapping on SCCG(4,4). 
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Table 2: Comparison between Abouzeid (1992) and TAB A 



Abouzeid TABA(3,3) TABA(4,4) 



Circuit 


trans 


SCCGs 


trans 


SCCGs 


trans 


SCCGs 


5xpl 


408 


49 


326 


31 


302 


25 


9sym 


714 


77 


448 


41 


404 


30 


bw 


510 


64 


552 


68 


516 


59 


misexl 


212 


30 


164 


17 


148 


13 


rd53 


182 


22 


86 


12 


82 


11 


rd73 


576 


63 


186 


22 


174 


19 


rd84 


1020 


no 


302 


33 


290 


30 


sao2 


504 


59 


406 


43 


362 


32 


hyetil 


9714 


1003 


8704 


889 


7880 


683 


hyeti2 


6500 


685 


5740 


602 


5208 


469 


Total 


20340 




16914 




15366 




%Total 


100% 




83% 




76% 





5 CONCLUSION 

The Library Free Technology Mapping problem was introduced and discussed. 
We have presented a method for mapping a set of Boolean equations into a set of 
Static CMOS Complex Gates (SCCGs) based on a new class of Binary Decision 
Diagrams, the TSBDDs. This method was implemented in a tool called TABA 
and it has two main features. First, it exploits the re-ordering of the structure to 
obtain a minimal number of complex gates. Second, it takes into account polarity 
assignment at early stages and inverter minimization is carried out together with 
buffer insertion and fan-out limitation. Experimental results have demonstrated 
that this method gives a better reduction in the overall number of transistors when 
compared to previously published results for Library Free Technology Mapping. 
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Abstract 

This paper addresses the problem of binary decision diagram (BDD) mini- 
mization in the presence of don’t care sets. Specifically, given an incompletely 
specified function g and a fixed ordering of the variables, we propose an ex- 
act algorithm for selecting / such that / is a cover for g and the binary 
decision diagram for / is of minimum size. We proved that this problem is 
NP-complete. Here we show that the BDD minimization problem can be for- 
mulated as a binate covering problem and solved using implicit enumeration 
techniques similar to the ones used in the reduction of incompletely specified 
finite state machines. 

Keywords: Logic Synthesis, Binary Decision Diagrams, Finite State 
Machines. 



1 INTRODUCTION 

A completely specified Boolean function / is a cover for an incompletely spec- 
ified function g if the value of / agrees with the value of g for all the points 
in the input space where g is specified. This paper describes an exact algo- 
rithm for selecting / such that / is a cover for g and the binary decision 
diagram (BDD) for / has a minimum number of nodes (complemented edges 
are not considered here). For a given ordering of the variables, the BDD for 
/ is unique (Bryant 1986) and the problem has a well defined solution. This 
problem was proved NP-complete (Oliveira, Carloni, Villa & Vincentelli 1996) 
using Takenaga & Yajima’s (1993) result that the problem of identification of 
the minimum BDD consistent with a set of minterms is NP-complete. 

We show that this minimization problem can be solved by selecting a mini- 
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mum sized cover for a graph that satisfies some additional closure conditions. 
In particular, we show that the minimum sized binary decision diagram com- 
patible with the specification can be found by solving a covering problem that 
is very similar to the covering problem obtained using exact algorithms for 
the reduction of incompletely specified finite state machines (ISFSM) (Kam, 
Villa, Brayton & Vincentelli 1994). This similarity makes it possible to use 
implicit enumeration techniques developed for the purpose of ISFSM reduc- 
tion (Kam et al. 1994) to solve efficiently the BDD minimization problem. The 
representation with ROBDDs (Brace, Rudell & Bryant 1990) of the charac- 
teristic function of the sets of compatibles and prime compatibles allows the 
generation of very large sets that cannot be enumerated explicitly. 

The transformation presented in this paper and the algorithms developed 
for the solution are important for a variety of reasons. In applications of 
inductive learning that use BDDs as the representation scheme (Oliveira & 
Vincentelli 1996), the accuracy of the inferred hypotheses is strongly depen- 
dent on the complexity of the result. The selection of the minimum BDD 
consistent with an incompletely specified function is important also in logic 
synthesis applications that use BDDs to derive gate-level implementations 
from a BDD, like timed Shannon circuits, DCVS trees and multiplexer based 
FPGAs. 

Several heuristic algorithms have been proposed for this problem. The re- 
strict (Coudert, Berthet & Madre 1989) and the constrain operators are two 
heuristics commonly used to assign the don’t cares of a BDD. A compre- 
hensive study of heuristic BDD minimization has been presented by Shiple, 
Hojati, Vincentelli & Brayton (1994). 

An exact algorithm (Ranjan, Shiple & Hojati 1993) based on the enumer- 
ation of the different covers that can be obtained by all possible assignments 
of the don’t care points has also been proposed. A pruning technique reduces 
the enumeration process thanks to a result that changing the value of a func- 
tion / of n variables on a minterm m cannot change the size of the BDD for 
/ by more than n nodes. Although this pruning is performed implicitly, this 
method is exponential on the number of don’t care points, and therefore is 
not applicable to problems of non-trivial size. 



2 DEFINITIONS 

We use the standard notation for BDDs. A BDD is a rooted, directed, acyclic 
graph where each node is labeled with the name of one variable. A BDD 
is called reduced if no two nodes exist that branch exactly in the same way 
and no redundant nodes exist. A BDD is ordered if there is an ordering of 
the variables such that, for all possible paths in the graph, the variables are 
always tested in that order. 

The level of a node C{ni) is the index of the variable tested at that node 
under the specific ordering used. The level of a function h, C{h), is defined as 
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the level of a BDD node that implements h. If rii is a node in the BDD and 
m a minterm, rii{m) will be used to denote both the value of function rii for 
minterm m and the terminal node that m reaches when starting at rii. 

A 3 Terminal BDD (3TBDD) is defined in the same way as a BDD in 
all respects except that it has three terminal nodes : rio and rix, that 

correspond to the zero, one and undefined terminal nodes. A 3TBDD F 
corresponds to the incompletely specified function / that has all minterms in 
/ofFj /on and /dc terminate in rio and respectively. 



3 THE COMPATIBILITY GRAPH 

Previous algorithms (Ranjan et al. 1993) for this problem used directly the 
BDD representation of /on and /off. The exact approach described in this pa- 
per uses the 3TBDD F that corresponds to the incompletely specified function 
f . F is assumed to be complete. If necessary, F is made complete by adding 
extra nodes that have the then and else edges pointing to the same node. In 
general, the resulting 3TBDD is no longer reduced. Moreover, we suppose that 
the 3TBDD does not use complemented edges. The definition of the algorithm 
is based on the lemmas and definitions that follow. Due to space limitations, 
proofs for all the lemmas are omitted here and can be found in Oliveira et al. 
(1996). 

Definition 1 Two nodes Ui and nj in F are compatible (ui ^ nj) iff no 
minterm m exists that satisfies ni{m) = Uz A rij(m) = Uo or ni{m) = rio A 
nj{m) = Hz. 

This definition implies that Uo and are not compatible between them 
and that Ux is compatible with any node in a 3TBDD. 

Definition 2 Two nodes ni and nj in F are common support compatible 
(ui « Uj) iff there exists a completely specified function h such that 
h^ Hi and h ^ Uj and C{h) > max(£(rii),£(nj)). 

The definition implies that Uz ^ no and Ux ^ Ui^ for any node n{. 

It is important, at this point, to understand the relationship between these 
two concepts. First, note that the completely specified function h referred in 
Definition 2 does not necessarily correspond to any node in F. In fact, in 
most cases, h will not correspond to any node in F, since most nodes in F 
correspond to incompletely specified functions. 

The relationship between compatibility and common support compatibility 
(CSC) is given by the following lemma: 

Lemma 1 //rii « nj then n* nj. 
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The reverse implication of lemma 1 is not true, in general. Two nodes may 
be compatible but not CSC, as shown by the example presented by Oliveira 
et al. (1996). However, when two nodes belong to the same level, common 
support compatibility and compatibility are equivalent: 



Lemma 2 If C{rii) = C{nj) then Ui ^ nj =>► n* « nj. 

The motivation for the definition of common support compatibility can now 
be made clear. Assume that two nodes belong to different levels and are com- 
patible. In principle, they could be replaced by a new node that implements a 
function compatible with the functions of each node. In general, this function 
may depend on variables that are not on the support of the node at the higher 
level. Assume this node is Uj. Later, when we try to build the reduced HDD, 
edges that are incident into nj will need to go upwards, against the variable 
ordering of the HDD. On the other hand, if both nodes are common support 
compatible, then they can be replaced by a node that implements the com- 
pletely specified function h referred to in Definition 2. Because this function 
only depends on the variables common to the supports of both nodes, this 
problem will not arise. 

The concept of common-support compatibility can be extended to sets of 
nodes in the natural way: 

Definition 3 The nodes in the set s* = {^i, ^ 2 , • • • j common support 
compatible iff there exists a completely specified function h such that {h ^ 
nj^jz=i^...^s and E{}i) ^ ■^max(^t)* 

Definition 4 A set of nodes that are common support compatible is called a 
compatible set or, simply, a compatible. 

The definition of a compatible implies that any two nodes that belong to a 
compatible are pairwise common support compatible. The reverse implication 
is not true, but the next lemma holds. 

Lemma 3 Let S{ be a set of nodes belonging to the same level. Then, Si is a 
compatible iff all nodes in Si are pairwise common support compatible. 

Definition 5 The compatibility graph, G = (F, E), is an undirected graph 
that contains the information about which nodes in F can be merged. Except 
for the terminal node Uxj each node in F will correspond to one node in V 
with the same index. The level of a node in G is the same as the level of 
the corresponding node in F. Similarly, gf^^ and are the nodes that 
correspond to and 
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Graph G is built in such a way that if nodes rij and nj are common support 
compatible then there exists an edge between gi and gj. An edge may have 
labels. A label is a set of nodes that expresses the following requirement: if 
nodes gi and gj are to be merged, then the nodes in the label also need to be 
merged. There are three types of labels: e, t and I labels. The following two 
lemmas justify the algorithm by which graph G is built: 



Lemma 4 If C{rii) = C{nj) then ni « nj (nf A . 

Lemma 5 IfC{ni) < C{nj) thenui « nj => (nf®® « « rijAnf®^ « 



The previous two lemmas justify the following algorithm to build the com- 
patibility graph. 



Algorithm 1 

1. Initialize G with a complete graph except for edge (gz,9o) that is removed. 

2 . If C{gi) = C{gj) then the edge between gi and gj has two labels: an e label 

with {gf^^.gf^^} o.nd a t label with • (By Lemma 40 

3. If C{gi) < C{gj) edge (gi,gj) has an I label with (By 

Lemma 5.) 

4 . For all pairs of nodes {gi,gj) check if the edge between nodes gi and gj has 
a label that contains {gai9b} o-nd there is no edge between ga and gt. If so, 
remove the edge between gi and gj . Repeat this step until no more changes 
take place. 

Figure 1 shows an example of the 3TBDD F obtained from / defined by the 
following sets: /on = {Oil, 111}, /off = (010, 110, 101} and the corresponding 
compatibility graph. 

The existence of an edge in the incompatibility graph is related with com- 
mon support compatibility and with compatibility between pairs of nodes in 
the following way: 

Lemma 6 n* « nj 3e € Es.t. e = {gi^gj) => Ui ^ nj. 

It is important to note that the reverse implications are not true. In particu- 
lar, the existence of an edge between two nodes in G does not imply that they 
are common support compatible, as it is possible to have an edge between two 
nodes in G that are not CSC. 
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Xl 

X2 

X3 





Figure 1 The 3TBDD F and the compatibility graph G. Nodes and Qx 
are not shown in the compatibility graph, since they are common support 
compatible with every node in the graph. 



4 CLOSED CLIQUE COVERS 

A clique of graph G is a completely connected subgraph of G. To any set s 
of nodes that is a clique of G there are associated class sets. If the nodes in s 
are to be merged into one, the nodes in its class sets are also required to be in 
the same set. Let Si = be a set of nodes that form a clique in 

G. The following are the definitions of the e, t and I classes of S{. Notice that 
for concision we may blur the distinction between the nodes p’s of G and the 
corresponding nodes n’s of F. Strictly speaking, cliques are defined on sets of 
p’s and compatibles on sets of n’s. 

Definition 6 The e (t) class of Sij Ce{si) is the set of nodes that are in some 
e (t) label of an edge between a node gj and gk in S{ with C{nk) — C{rij) — 

Definition 7 The I class of Si, Ci{si) is the set of nodes that are in some I 
label of an edge between a node gj and gk in S{ with C{gj) ^ C{gk) 

Lemma 7 If a set Si of nodes are a clique of G and Ci{si) C then Si is a 
compatible set. 

Note that a clique of G that does not satisfy the condition in Lemma 7 
is not necessarily a compatible set. The algorithm that selects the minimum 
BDD compatible with the original function works by selecting nodes of G that 
can be merged into one node in the final BDD. If a set s of nodes in G is to 
be merged into one, the set s has to be a compatible set. Therefore, it has 
to be a clique of G satisfying Definition 7. The objective is to find a set of 
cliques such that every node in G is covered by at least one clique. However, 
to obtain a valid solution, some extra conditions need to be imposed. 
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Definition 8 A set S = {5i,52...Sn} of sets of nodes in G is called a closed 
clique cover for G if the following conditions are satisfied: 

1. S covers G: G G 5 : pi G Sj. 

2. All Sk are cliques of G: "iguQj G Sk : {gi,gj) € edges(G). 

3. S is closed with respect to the e and t labels: 

Vsi G S3sj G S : Ce{si) C Sj /\ Vsi G S3sj G S : Ct{si) C sj. 

4 . All sets in S are closed with respect to the I labels: Vsi G S : Ci{si) C Si. 



5 GENERATION OF A MINIMUM BDD 

From a closed clique cover for G, a reduced BDD R is obtained by the following 
algorithm: 

Algorithm 2 

1. For each Si in S, create a BDD node in R, fi, at level £max(5i). 

2. Let the nodes in R that correspond to sets S{ containing nodes that corre- 
spond to terminal nodes in F be the new corresponding terminal nodes of 
R. 

3. Let the else edge of the node ri go to the node rj that corresponds to a set 
Sj such that Ce{si) C Sj. 

4 . Let the then edge of the node ri go to the node rj that corresponds to a set 
Sj such that Ct{si) C Sj. 

Lemma S R is an Ordered BDD compatible with F. 

Now, the main result follows. Let B be the set of all BDDs that represent 
functions compatible with the incompletely specified function /. Then: 

Theorem 1 The BDD induced by a minimum closed cover for G is the BDD 
in B with minimum number of nodes. 

As an example, S = {{90,91,92}, {94}, {ffs, 55, 9z}, {Po}} is a closed cover 
for the example depicted in Figure 1 and induces the BDD R shown on the 
right side of Figure 2. 

The definition of a closed cover is very similar to the standard definition of 
a closed cover used in the minimization of FSMs. If the graph of a 3TBDD 
is viewed as the state transition graph of an FSM, the algorithms developed 
for the minimization of FSMs can be used with some modifications. The two 
important diflFerences to consider are: 
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Figure 2 The 3TBDD F, the compatibility graph G and a solution R, Node 
^5 was arbitrarily included in compatible 



1. The definition of the e and t classes and the closure requirement in point 
3 of Definition 8 are diflFerent from the definitions used in standard FSM 
minimization. In BDD minimization, only nodes at the highest level in some 
compatible define the e and t classes, while in standard FSM minimization 
all nodes in a compatible set are involved in the definition of these classes. 

2. The requirement in point 4 of Definition 8 means that some sets of nodes 
that satisfy the definition of a compatible set in the FSM case do not satisfy 
the conditions for BDD minimization. 

These two changes can be incorporated into existing algorithms for FSM 
minimization. In particular, the closure conditions with respect to the e and 
t labels are similar to the closure conditions imposed in standard FSM mini- 
mization. The restriction imposed by condition 4 in Definition 8 simply elim- 
inates some cliques of the compatibility graph from consideration and can be 
implemented by a filtering step. The transformation from BDD minimization 
to FSM reduction and its correctness are shown in Oliveira et al. (1996). 



6 IMPLICIT COMPUTATION OF A MINIMUM CLOSED 
COVER 

We will use the unified implicit framework proposed in (Kam et al. 1994). 
Implicit techniques are based on the idea of operating on discrete sets by 
their characteristic functions represented by BDDs (Bryant 1986). 

To perform state minimization, one needs to represent and manipulate effi- 
ciently sets of sets of states. With n states, each subset of states is represented 
in positional-set form, using a set of n Boolean variables, x = xiX 2 . • - Xn- 
The presence of a state Sk in the set is denoted by the fact that variable Xk 
takes the value 1 in the positional set, whereas Xk takes the value 0 if state 
Sk is not a member of the set. For example, if n = 6, the set with a single 
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state S4 is represented by 000100 while the set of states S2S355 is represented 
by 011010. 

A set of sets of states S is represented in positional notation by a charac- 
teristic function B as: = 1 if and only if the set of states 

represented by the positional set x is in the set S. A BDD representing Xs(^) 
will contain minterms, each corresponding to a state set in S. As an example, 
Tuplen,k{x) denotes all positional sets x with exactly k states in them (i.e. 
\x\ = k). For instance, the set of singleton states is Tupleny{x). An alternative 
notation for Tuplen,k{x) is Tuplek{x). 

Any relation R between pairs of sets S\ and S 2 can be represented by 
its characteristic function x B where TZ{x,y) = 1 if and 

only if X5i(^) = I? XS 2 (y) = 1 the element of Si represented by x is in 
relation R with the element of S 2 represented by y. A similar definition holds 
for relations defined over more than two sets. For example, we represent the 
state transition graph (STG) of an FSM by the characteristic functions of two 
relations: 1) the output relation A, where input i, present state p and output 
o are in A(i,p,o) if there is an edge from p with input/output label i/o, and 
2) the next state relation T, where where input i, present state p and next 
state n are in relation T{i,p,n) if there is an edge from p to n with input 
label i. 

It has been shown in Section 5 that given a BDD minimization problem it 
is possible to generate a companion FSM whose closed covers of compatibles 
correspond to closed clique covers of the BDD, if: a) FSM compatibles that 
do not satisfy the L-closure are discarded, and b) FSM compatible closure is 
replaced by E-closure and T-closure. Our starting point is the fully implicit 
algorithm for exact state minimization reported by Kam et al. (1994), to 
which we refer for a complete description of the implicit computations. In the 
sequel we discuss the modifications needed to generate closed clique covers of 
the BDD. 

Consider the set of compatibles C(c), where C(c) = 1 iff c is the positional 
set representing a compatible of the companion FSM. When minimizing an 
FSM obtained from an instance of BDD minimization one must delete from 
C(c) the compatibles c that are not closed with respect to their /-class. The 
/-class, C/(c), of a compatible c is the set of nodes that are in some /-label of 
an edge between nodes pj and pk in c with C{pj) < C{9,). If dgj) < C{9k) 
then edge (9j,9k) has the Mabel 

In standard FSM minimization one requires closure with respect to implied 
sets. Given a compatible c an implied set under input i is the set of next states 
from the states in c under i. Instead in the case of BDD minimization one 
must compute the implied sets only from the states in c of highest level. This 
requires a change in the computation of the relation of the implied classes 
!F{c,i,n), 
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The new computation for J^(c,i,n) is described by the following equation: 



T{cJ,n) = 3p {3c' [C(c) • MaxJjevel{c^ d) • (c' D p)] • T(i,p,n)} 



Subsets of states c and c' are in relation MaxJLevel{c, c'), iff c' is the subset 
of c that contains the states of c of maximum level, i.e. the states having the 
largest distance from r in the STG of the FSM. 



7 RESULTS 

Starting from the program ISM for implicit state minimization (Kam et al. 
1994) we developed imagem, a new program based on the theory described 
in this paper for exact BDD minimization. To evaluate experimentally the 
algorithms presented in this paper, we assembled two sets of problems: the 
first set derives directly from a machine learning application and the second 
set was obtained from a logic synthesis benchmark. In all the problems, the 
original ordering specified for the variables was the ordering used. 

For the first set of problems, 12 completely specified Boolean functions 
fi were used as the starting point. For each of these functions, a randomly 
selected set of minterms was designated as the care set, resulting in a set of 
incompletely specified Boolean functions gi. The objective was to verify if the 
algorithm was able to identify a BDD no larger than the BDD for fi, which 
represents a known upper bound on the solution. The second set of problems 
was obtained by selecting a subset of the examples that are distributed with 
Espresso (Brayton, Hachtel, McMullen & Vincentelli 1984), a well known two- 
level minimizer. We included here the functions that are the first output from 
each of the PLAs that are included in the industry subset of the Espresso 
benchmark suite, after eliminating all the functions that have a null don’t 
care set. 

Table 1 summarizes the results obtained from running these sets of exam- 
ples. The last entry in the table is the example presented in this paper to 
illustrate the theory. 

The first four columns report the original number of states, the number 
of compatibles, the number of compatibles after filtering (i.e. the ones which 
are closed with respect to their /-class) and the number of primes. The next 
two columns report the exact result obtained and the result obtained by the 
restrict operator (Coudert et al. 1989). The last column contains the time 
spent by IMAGEM to find the solution: all run times are reported in CPU 
seconds on a DEC Alpha (300 Mhz) with 2Gb of memory. For all experiments, 
“timeout” has been set at 21600 seconds of CPU time and “spaceout” at 2Gb 
of memory. 
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example 


orig. 


compat. 


filtered 


prime 


red. 


restrict 


Imagem 




states 




comp. 


comp. 


states 




Cpu time 


dnfa 


64 


2.4e-fl2 


1332186 


89 


14 


16 


517.29 


dnfb 


36 


4.8e+08 


2987 


94 


6 


12 


11.85 


dnfc 


40 


2.2e+08 


2613 


102 


10 


15 


12.94 


dnfd 


93 


l.le+20 


9.504-08 


- 


- 


23 


timeout 


dnfe 


63 


2.1e-l-13 


141179 


509 


6 


12 


217.29 


dnff 


62 


2,le-fll 


92027 


357 


15 


22 


151.8 


xor3 


9 


179 


14 


7 


6 


6 


0.1 


xor4 


17 


14975 


118 


13 


6 


6 


0.43 


xor5 


24 


608255 


267 


36 


9 


10 


1.37 


xor6 


40 


3.3e-f"08 


1329 


170 


13 


20 


13.98 


xor7 


57 


2.7e+ll 


3076 


640 


15 


31 


88.16 


xor8 


94 


1.5e4-17 


164929 


21830 


17 


45 


9041.11 


alul 


95 


1.0e-f21 


841993 


1204 


6 


6 


7409.97 


brl 


74 


2.9e+18 


799173 


329 


6 


11 


1313.91 


br2 


51 


5.9e-fl4 


53687 


78 


3 


8 


14.59 


dpi 


50 


1.4e+13 


7559 


39 


3 


13 


12.39 


dc2 


46 


8.2e4-10 


8831 


98 


8 


12 


57.66 


exp 


54 


2.6e+ll 


10638 


25 


3 


3 


31.34 


exps 


71 


1.8e+10 


3810 


125 


43 


44 


44.79 


inO 


151 


2.6e+25 


1.6e-H06 


1323 


42 


44 


18201.76 


in3 


173 


5.0e+39 


587880 


12 


9 


14 


1755.21 


inc 


35 


l.le+07 


364 


26 


12 


13 


3.84 


intb 


189 


4.8e-h46 


3.8e-hl4 


- 


- 


69 


spaceout 


markl 


71 


7.4e+18. 


8049 


35 


4 


5 


41 


newapla 


52 


1.2e+12 


3252 


33 


10 


11 


41.5 


newaplal 


57 


8.7e-fl4 


8733 


63 


6 


6 


141.66 


newapla2 


19 


93311 


137 


6 


5 


5 


0.49 


newbyte 


16 


20735 


127 


9 


5 


5 


0.41 


newcond 


165 


3.8e-f31 


7.4e-M2 


- 


- 


54 


spaceout 


newcpla2 


39 


3.3e+08 


477 


68 


10 


21 


5.72 


newcwp 


16 


10367 


106 


10 


6 


11 


0.39 


newtpla 


94 


1.2e+23 


411525 


148 


7 


23 


469.14 


newtplal 


39 


6.9e-f09 


1441 


31 


4 


5 


4.45 


newtpla2 


26 


3.1e-f06 


158 


9 


9 


9 


0.9 


newxcplal 


39 


4.4e-f09' 


1473 


35 


5 


10 


5.13 


p82 


16 


15551 


102 


10 


7 


7 


0.4 


proml 


65 


5.1e+09 


382 


77 


50 


50 


30.04 


prom2 


33 


2.1e+08 


446 


38 


12 


12 


3.33 


sex 


28 


1.6e+07 


419 


16 


5 


5 


1.62 


spla 


155 


1.6e-f39 


1.4e-f-12 


- 


- 


8 


spaceout 


sqn 


41 


1.0e4-07 


173 


43 


19 


19 


9.13 


t4 


68 


5.1e+14 


31775 


157 


9 


11 


89.98 


vg2 


150 


3.6e+36 


4.0e-h07 


- 


- 


14 


timeout 


wim 


14 


4319 


82 


8 


6 


6 


0.26 


ex. paper 


10 


575 


32 


8 


4 


4 


0.17 



Table 1 Results obtained in the sets of problems studied. 
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8 CONCLUSIONS 

This paper addresses the problem of binary decision diagram (BDD) mini- 
mization in the presence of don’t care sets. We show that the minimum-sized 
binary decision diagram compatible with the specification can be found by 
solving a problem that is very similar to the problem of reducing an ISFSM. 
The approach described is the only known exact algorithm for this problem 
not based on the enumeration of the assignments to the points in the don’t 
care set. We show that this minimization problem can be formulated as a 
binate covering problem and solved using implicit enumeration techniques. 
We have implemented this algorithm and performed experiments, by means 
of which exact solutions for an interesting benchmark set were computed. 
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Abstract 

This paper* introduces a new logic transformation that integrates retiming 
with algebraic and Boolean transformations at the technology-independent 
level. It offers an additional degree of freedom in sequential network optimiza- 
tion resulting from implicit retiming across logic blocks and fanout stems. The 
application of this transformation to sequential network synthesis results in 
the optimization of logic across register boundaries. We have implemented our 
new technique within the SIS framework and demonstrated its effectiveness 
in terms of cycle-time minimization on a set of sequential benchmark circuits. 

Keywords 

Sequential Logic Synthesis, Logic Optimization, Retiming 
1 INTRODUCTION 

Over the years, sequential circuit synthesis has been a subject of intensive 
investigation. Though synthesis of combinational logic has attained a signifi- 
cant level of maturity, sequential circuit synthesis is lagging behind. In current 
state of affairs, sequential networks are first optimized by applying combina- 
tional network transformations to the logic between the register boundaries, 
and mapped into the gate-level network. The resulting network is then often 
optimized by applying retiming transformation [8]. 

Retiming is the process of relocating the registers across logic gates without 
affecting the underlying combinational logic structure. It can be used to min- 
imize cycle-time or the number of registers under the cycle-time constraint. 
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While in principle retiming can be applied at various levels of synchronous 
system design, it has been traditionally used as a structural transformation in 
gate-level circuit optimization. As such, gate-level retiming exploits only one 
degree of freedom in circuit optimization, namely, the relocation of registers. 
Furthermore, gate-level retiming does not take into account the prospective 
logic simplification. Potential for the optimization by subsequent re-synthesis 
is very limited, as it is typically applied to the logic between register bound- 
aries. 

In this paper we investigate the issue of retiming at the technology indepen- 
dent level. We introduce a novel and efficient approach to synthesis and op- 
timization of synchronous sequential circuits, in which retiming is performed 
implicitly during logic optimization. 

There have been several attempts to combine retiming with algebraic net- 
work transformations in the quest to optimize the logic across register bound- 
aries. Peripheral retiming introduced by Malik et al [10] optimizes the un- 
derlying combinational logic after a temporary relocation of registers to the 
periphery of the circuit. It suffers from a limited mobility of registers during 
the peripheral movement phase. DeMicheli [2] introduced a concept of syn- 
chronous divisors and used it in local logic optimization across the register 
boundaries. Lin [9] formalized the theory for synchronous extraction to detect 
potential common divisors. Both methods operate on the structural specifi- 
cation of a synchronous circuit, and do not take into account the prospec- 
tive logic simplification during synchronous division. Dey et al [3] proposed 
a method to improve effectiveness of retiming by attempting to eliminate re- 
timing bottlenecks. Chakradhar et al. [1] introduced special timing constraints 
which are used to resynthesize the circuit. The modified circuit is subsequently 
retimed, and the constraints (if satisfied by the delay optimizer) guarantee 
that the circuit is retimable and meets the desired cycle time. 

Retiming has been also used in the context of minimizing latency (rather 
than clock period) in pipelined circuits. A number of papers addressed a prob- 
lem of combining retiming with architectural and structural transformations 
to minimize the latency and/or throughput. The scheme proposed by Potkon- 
jak et al. [11] uses retiming to enable algebraic transformations that can fur- 
ther improve latency /throughput. Hassoun et al [5] introduced a concept of 
architectural retiming which attempts to increase the number of registers on a 
latency-constrained path, without increasing the overall latency. These seem- 
ingly contradicting goals are achieved by implementing “negative” registers 
using precomputation and prediction techniques. In the process, the circuit is 
structurally modified to preserve its functionality. 

Most of the techniques mentioned above operate on a structural represen- 
tation of the network. The cost function that guides retiming in network 
optimization does not take into account the potential for subsequent logic 
simplification. In contrast, our approach takes into account the effect of re- 
timing on logic simplification. It operates directly on a functional specification 
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given in terms of synchronous Boolean expressions. It is an iterative synthe- 
sis process which integrates retiming with extraction, collapsing, and node 
simplification, into one synchronous transformation. It efficiently handles re- 
timing across fanout stems while preserving initial state. It also provides a 
simple method to compute initial state of the resynthesized circuit, consistent 
with the original network specification. 



2 PRELIMINARIES 

A Boolean function, F, of n variables is a mapping / : — > B, where 

B = {0, 1}. A literal is a Boolean variable or its complement. A cube is defined 
as a product of literals. The support of a Boolean function is defined as a 
set of all variables that appear in the function. An expression is said to be 
cube-free when it cannot be factored by a cube. A kernel of an expression is 
a cube-free quotient of the expression divided by a cube. Extraction is the 
process of factoring out a subexpression from one or more logic functions of a 
network and creating a new node for the extracted expression. Collapsing or 
elimination is the process of (re)expressing a Boolean function representing a 
node in the logic network in terms of the support variables of its fanin node. 

Forward retiming is the operation of shifting the registers from the inputs to 
the outputs of a node in a Boolean network; backward retiming is the reverse 
operation. A node in the network can represent an arbitrary Boolean function. 
It has been shown that such a transformation preserves the behavior of the 
circuit [8]. Forward and backward retiming transformations are illustrated in 
Fig. 1 a). A node is said to be forward (backward) retimable if each of its 
input (output) edges contain a register. Retiming across a fanout stem is the 
operation of forward retiming of a multiple-fanout register across its fanout 
stem. Retiming across a fanout stem imposes an equivalence relation on the 
fanout registers. All network transformations and initial state computation, 
must take into account this register equivalence. An expression is called a 
retimable expression if all the variables in its support set are register variables. 




Figure 1 a) Retiming of a logic node b) Retiming across a fanout stem 

Associated with each register is a pair of variables (Ri,ri), where Ri is the 
input to the register and is its output, referred to as a register variable. 




330 



Part Eight Synthesis and Technology Mapping 



The variables ri and R{ can be viewed as inputs and outputs, respectively, 
of the combinational part of the sequential network, with registers providing 
feedback paths. 



3 THEORY AND ALGORITHMS 

Traditional retiming across a logic gate in a gate-level network (or across a 
node in a Boolean network) can be extended to a retiming across an arbitrary 
subexpression of the original logic function. Such a retiming, combined with 
the extraction of a suitable expression, forms the basis of our new sequential 
transformation. We shall refer to it as the logic retiming transformation, for 
lack of a better term. The following sections describe the operations involved 
in logic retiming. 



3.1 Retime Extraction 



Example 1: Consider the sequential logic network represented by the follow- 
ing equations and shown in Fig. 2a): 



Oi 




( 1 ) 


Ri 


= Tir2i2 rsi2 




R2 


= hr2 




Rs 


= ^2 -f hrs 





In these equations i{ denotes a primary input and denotes a register variable 
(present state variable). Oi is a primary output function and Ri is a register 
function (next state function). 




Figure 2 a) The original network, b) Retime-extraction 

Consider subexpression kr = rif 2 -hrs, common to 0\ and Ri. This subex- 
pression can be extracted from the expressions for 0\ and R\ and used to 
create a new node in the network, 1^5 • Since all the inputs to kr are register 
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variables, this expression is forward retimable. Forward retiming across Va ;5 
leads to the creation of a new register represented by variables (iZ 4 , r 4 ). After 
the retiming, the expression for R 4 is then given in terms of register input 
variables Ri, as illustrated in Fig. 2 b). 

This transformation can be expressed as a new operation, called retime- 
extraction, which is the basis of our logic retiming transformation. For a given 
retimable expression kr, the following steps implement retime-extraction: 

1. For every node fi of the network, containing expression kr, substitute the 
expression with a variable ffc. 

2. Introduce a new node corresponding to kr expressed in terms of register 
input variables, Ri. Represent it by register function Rk- 

3. Introduce a new register {Rk,rk}. 

It should be emphasized that, whenever the register variables in the sup- 
port of retimable expression kr fan out to other functions, the retime-extract 
operation involves implicit retiming across fanout stems. In our example this 
applies to registers R 2 , Rz which have multiple fanouts. Consequently, a set of 
equivalence relations will be imposed on these registers and used in the sub- 
sequent logic simplification. On the other hand, if a register involved in the 
retime-extraction fans out only to the retimable expression, it will be rendered 
redundant by the transformation and subsequently removed. In our example, 
Ri fans out only to the retime-extracted expression, and can be removed along 
with the associated logic function (Fig. 4). 



3.2 Collapsing and Simplification 

The next step is to collapse the node represented by a new variable Rk into 
its fanin nodes, as shown in Fig. 3. The resulting expression is then simplified. 
Notice the implicit duplication of logic, necessary to perform the collapsing 
and simplification. This ensures that the functionality of the rest of the net- 
work remains unchanged. In our case, logic for R\,R 2 ,Rz is duplicated (see 
the area marked by the dotted line). The simplification is possible, in effect, 
due to register equivalence imposed on fanout registers. For simplicity, in all 
the figures, we use the same variable name for each of the registers obtained 
after retiming across a fanout. 

In our case the collapsing and simplification lead to the following expression: 

Jf ?4 = R 1 R 2 4- i?3 = (^4^2 )(h^2) -H (^2 H- h^z) = ^2 4" h't'z (2) 

The simplified Boolean expression for Rk is also referred to as a retime- 
expression RE{kr). It can be calculated for every retimable kernel or cube kr 
using the above procedure. The computation of RE{kr) is central to the logic 




332 



Part Eight Synthesis and Technology Mapping 



x3»ilr2 




z5.>RlR2 + R3 


V.3 








x3=ilr2 


x4si2 * ilr3 


/ H. 




\ 


V.4 




U 


^ x4*i2 + ilr3 




R3^ 






Figure 3 Collapsing of R 4 into its fanin nodes 



retiming transformation. In our example, the expressions associated with 14s 
and VxA are the same and hence can be removed, as shown in Fig. 

4(a). Finally, notice that register function R\ is not used. This is because the 
register disappeared as a result of retime extraction across rir 2 +r 3 . Therefore, 
the combinational logic function associated with the register function can be 
deleted. The resulting network is shown in Fig. 4(b). Furthermore, since the 
register functions i? 3 , i ?4 are identical, the two registers could be merged into 
one, provided that their initial states are identical, i.e., fg = rj. Whether this 
is possible or not, depends on the initial conditions imposed on the network 
(see the next section on initial state computation). 




Figure 4 a) Network after simplification b) Final network with redundant 
logic removed 

This network is the direct result of our logic retiming transformation. The 
retime-extraction, collapsing and simplification transformations are performed 
implicitly through the computation of the retime-expression. 



3.3 Initial State Computation 

The initial state computation upon forward retiming across an arbitrary logic 
expression, as formally given in [12], is straightforward. Let rf be the initial 
value of a register (Ri.ri). For a retimable expression fcr(ri,r 2 , ...,rn), the 
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initial value of the register {Rk,Vk)^ added by the retime-extraction, is given 
by = fcr(r*J,r 2 , For the example above, with retimable expression 

kj. = nr 2 +7*3, the initial value of register (i? 4 ,r 4 ) is then given by = 

ry2+rl 



3.4 Cost Modeling 

Logic retiming is characterized by several important properties, which can be 
illustrated conceptually in Fig. 5. First, it can be shown that logic retiming 
does not degrade the overall cycle-time under the unit-delay model. Since 
the retime-expression node RE{kr) is obtained by means of collapsing and 
simplification, it will always be appended to the network at the same (last) 
level as the nodes that are collapsed into it. By definition, the arrival time at 
the output of this node will be no greater than the latest arrival time in the 
rest of the network. Hence adding a retime-expression node will not increase 
the topological longest path under this model. 

Realistically, since retime-extraction may increase fanout on some of the 
nodes (for example node i\ in Fig. 4), the critical path delay could actually 
increase. This may happen, for example, when a node on a critical path fans 
out to the newly created node, RE{kr) (see node V\ in the figure). This 
problem can be identified by considering an augmented delay model which 
takes the fanout factor into consideration. 





Figure 5 Conceptual view of logic retiming 



Finally, observe that the complexity of a node (measured e.g. in the number 
of literals) that is affected by retime-extraction will always be reduced by the 
extraction of the retimable expression (see node V in the figure). Since it can 
be argued that the complexity of a node refiects to a certain degree its delay, 
the delay of the critical path will be reduced, provided that retime-extraction 
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targets the expressions along that path (for example, if V 2 is on a critical 
path) and that the fanout increase does not offset the gain due to extraction. 

The key element to the efficiency of logic retiming is accurate estimation 
of the cost associated with a given retimable expression. Fig. 6 illustrates 
the idea of cost estimation based on simple literal count. It is important to 
note that the two candidate nodes, kr and RE{kr), are not yet part of the 
network. Two gains are computed: Ax for standard extraction, and Ar for 
retime-extraction. 

Ax = max{lit^count{Vl), lit^count{V2), litjcount{V3>)) 

Ar = lit-count{RE{Kr)) (3) 

The literal counts of nodes V'1,F2,F3 are computed before extraction or 
retime-extraction; these include the literals of kr. Retime-extraction (which 
results in the addition of node RE{kr)) is performed if Ar < Ax. 




Figure 6 Delay gain estimation based on literal count 

Also note that while this approach emphasizes the delay, it can also be used 
to target the logic area (approximated by the total number of literals). The 
gain in area can be computed by comparing Ar with Ay = litjcount{kr). 
Depending on the depth of collapsing and the amount of logic simplification, 
Ar may be greater or smaller than Ay. 

The initial experiments have shown that even this simplistic gain metric 
can result in cycle-time reduction. A more accurate approach, currently being 
considered, involves a fast node decomposition using Time-Driven Cofactoring 
(TDC) [4], 



3.5 Logic Retiming Algorithm 

Logic retiming is an iterative operation comprised of the following steps: 
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1 . Select a set of candidate subexpressions to be extracted. 

2. For each candidate subexpression, check if it is retimable. 

3. If retimable, estimate the delay gain of retime-extraction (Ar) and reg- 
ular extraction (Ax). It should be emphasized that the gain Ar for the 
retime-expression kr is based on all the transformations involved: retime- 
extraction, collapsing and simplification. 

4. If retime-extraction gives better gain, perform retime-extraction. Other- 
wise, perform regular extraction. 



3.6 Comparison with Extraction and Gate- level Retiming 



The following example illustrates that logic retiming can lead to circuit opti- 
mization (both in terms of delay and area) that is not possible with conven- 
tional multi-level synthesis or gate-level retiming alone, see Fig. 7. 

Example 2 : Consider again logic network for Example 1 (1). Recall that 
retime-extraction of expression kr = (rir 2 + rs) resulted in a new variable 
i?4 = R 1 R 2 + Rs = {r4i2){iir2) + (^2 + ^ 1 ^ 3 ) = ^2 + ii't's 



Extraction: 
extract kr, introduce 
variable X 5 = rir 2 +r 3 

Oi = X^ii -h 22 
jRl = Xi 22 

i?2 = iir2 
Rs = h+hrs 

Xi = (rir2+r3) 



Logic Retiming: 
retime.extract across 

R4 = R1R2 + j 

collapse and simplify. 

Oi = T 421 -h 22 

i?2 = hr2 

R 3 = 22-h2ir3 

R4 = ^2 H" H'f'S = Rs 



Retiming: 

structural operation, 
retime across {riT 2 ) 

Oi = (rs -hr 3 ) 2 i -f-22 
R 2 = i\r2 
Rs = h -f hrs 

i?5 = Ri i?2 




Figure 7 Comparison of logic retiming with extraction and retiming (feed- 
back loops Ri ri are omitted for simplicity). 
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4 IMPLEMENTATION AND EXPERIMENTAL RESULTS 



We implemented the logic retiming transformation within the SIS framework. 
The implementation of the logic retiming algorithm involves the generation 
of common subexpressions of the combinational part of the network; these 
common subexpressions are generated using the rectangle intersection algo- 
rithm used in SIS. Only those kernels whose value exceeds the user-defined 
threshold are selected. For each of the selected kernels we compare the regular 
extraction value with the retime-extraction value using the gain estimation 
technique. 

The cost function, used in our preliminary experiments, is the number of 
literals in the SOP form, as discussed in section 3.4. Although simplistic, this 
cost function allowed us to quickly validate the theory. This version of logic re- 
timing yielded delay improvements over the regular extraction transformation. 
Research is now focused on the application of the concept of retime-extraction 
to the transformations used in script, delay and other delay optimization tech- 
niques such as speed-up. We are also investigating the application of logic 
retiming to area minimization under cycle-time constraints. 

We have tested our technique on a number of sequential circuits from the 
ISCAS’91 benchmark set. The circuits were input as logic networks in blif 
format, its local functions (nodes) were collapsed into SOP form. Each circuit 
was then optimized using logic retiming and independently synthesized with 
standard SIS multi-level optimization. The circuits were resynthesized and 
mapped into the standard SIS lib2.genlib library. The script used for logic re- 
timing is identical to the script with conventional SIS transformations, except 
that the gkx command has been replaced by the “retime kernel extract” (rkx) 
command of logic retiming. The general structure of the scripts used in our 
experiments is given below: 



script, rkx 
sweep 

collapse or eliminate <threshold> 
simplify 
rkx <options> 
resub -a 
sweep 
simplify 



script, gkx 
sweep 

collapse or eliminate <threshold> 
simplify 
gkx <options> 
resuh -a 
sweep 
simplify 



The results are reported in Table 1, which compares the clock-cycle delay, 
number of registers, and area overhead of the circuits obtained by the two 
fiows. The delay was computed using the mapped delay model. Those circuits 
which did not contain any retimable kernels are not shown in the table. 



Even though our initial implementation of logic retiming used a simplistic 
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Table 1 gkx vs rkT. Comparison of mapped circuits 



Ckt 




rkx 






gkx 






% change 






Area 


Clk 


Reg 


Area 


Clk 


Reg 


Area Clk 


Reg 


s298 


167040 


9.59 


25 


145232 


10.95 


14 


15 


-12 


79 


s344 


198592 


13.50 


17 


187456 


17.06 


15 


6 


-21 


13 


s444 


223648 


9.51 


24 


203232 


13.09 


21 


10 


-27 


14 


s526 


228288 


10.10 


25 


208336 


13.64 


21 


10 


-26 


19 


s400 


266800 


11.05 


28 


211120 


12.95 


21 


26 


-15 


33 


S9234 


1156752 


31.90 


147 


1101536 


38.28 


135 


5 


-17 


9 


S5378 


1316832 


25.29 


189 


1286672 


26.31 


162 


2 


-4 


17 


s510 


245920 


24.49 


8 


223184 


28.20 


6 


10 


-13 


33 


S15850 


3912912 


104.1 


538 


3802480 


108.23 


504 


3 


-4 


7 


S1488 


629648 


42.72 


13 


607840 


39.67 


6 


4 


7 


117 


s382 


329904 


10.63 


33 


215760 


13.82 


21 


53 


-23 


57 



figure of merit based on literal count, most of the circuits synthesized with 
this technique showed a significant reduction in delay. We expect that with a 
better estimation scheme and more selective retime-extraction the transfor- 
mation will give better and consistent improvement in delay. The algorithm 
and library used for mapping would also have a significant impact on the delay 
of the optimized circuit; applying retime-extraction with explicit knowledge 
of the above information could improve the effectiveness of logic retiming. 
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Abstract 

This paper presents a novel mapping technique, which uses a Boolean match- 
ing algorithm based on testing techniques. The matching is detected by check- 
ing the controllability and observability of signals in the cell structure against 
the subject functions of the network. The method was implemented inside 
the Sis (Touati 1990) synthesis environment. The comparison with the Sis 
structural mapping shows that the Boolean mapping achieves better results 
in similar or smaller computing time. 

Keywords 

Technology mapping, Boolean matching. Observability, Covering 



1 INTRODUCTION 

Technology mapping is the final and strategic step of the logic synthesis. In 
multilevel implementations, the mapping starts with a logic optimized net- 
work, modeled as a direct acyclic graph (Dag), where each node represents a 
logic function. The mapping step translates this logic system into a network 
of logic gates (the netlist), while minimizing the cost: silicon area, circuit de- 
lay and/or power dissipation. It depends on two interrelated processes: gate 
matching and network covering. The gate matching checks which gates of the 
library may implement each node function. The covering selects a list of gates 
to implement the whole system with a minimum cost. 

Structural matching and Boolean matching are the two major approaches 
used to check a library cell. Structural matching verifies the equivalence be- 
tween the cell and the node function structures. Both cell and node functions 
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must be cast to a common representation, usually a simple netlist of 2 input 
gates (Nor2 or And2). The cell matching then reduces to check for graph iso- 
morphism. But one library cell may have several equivalent decompositions, 
and the matching has to consider all possible alternatives. To simplify the 
check, library cells are generally represented by trees, which excludes cells 
with internal fanout. Moreover, structural representation considers only com- 
pletely specified functions. 

A solution is to use Boolean matching, based on logic equivalence between 
Boolean functions. If /(xi , . . . , Xn) is the function of a node in the Boolean net- 
work, and g{yi,. ‘-^yn) the function of a cell in the library. Boolean matching 
reduces to find an input variable permutation tt and an input phase assignment 
(j)oix for which cell g produces function / or its complement: /(x) = p(7r0(x)) 
or /(x) = g{w<t>ix)) ? 

In practice, the function / may be incompletely specified, but the function g 
(the library cell) is always completely specified. The total number of variants 
to be considered is: (n! • 2” • 2). This number becomes quickly intractable for 
exhaustive checking. 

Several heuristic approaches have been proposed for the covering: rule based 
methods, algorithmic methods and stochastic methods. Rule based systems 
like (Barringer et al 1984), reduce the complexity by local transformations, 
in order to cover the network. But the method is too library dependent and al- 
lows only local optimizations. The algorithmic methods (Keutzer 1987, Savoj 
1992, Mailhot et al 1993) produce usually better results because they rely 
on global optimizations. Stochastic methods are global techniques based on 
space navigation, such as genetic algorithms. The covering adopted in Sis and 
in our approach are algorithmic based techniques. 

In this paper we present a Boolean mapping algorithm based on the Boolean 
matching proposed by (Trullemans et al 1996a). This matching derives from 
testing techniques that analyze a structural equivalent of the cell in terms of 
observability and controllability functions, and compare it to a binary decision 
diagram (BDD) representation of the node functions. An improvement of 
this matching technique is implemented in the Sis environment. We compare 
this new matching technique to the Sis approach, and show that it may deal 
efficiently with tree representations as well as non-tree cells such as Mux and 
Xor. 



2 DEFINITIONS 

Let xi,X 2 ,...,Xn be the variables of the space where B = {0,1}. We 
use X to represent a vertex or a vector of variables in B^. Let f : B^ 

B be a Boolean function, and Xi an input variable of /. The cofactor of 
/(xi, . . . , Xi, . . . ,Xn) with respect to variable Xi is: fxi=i = /(xi, Xn). 
The cofactor of /(xi, . . . ,Xi, . . . ,Xn) with respect to variable Xi is: fxi=o = 
/(xi, . . . ,0, . . . ,Xn). The Boolean difference of a function / with respect to 
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an input variable Xi, is the function A/^.. = fxi=i © /®i= 0 j where © is the 
exclusive-or operator. This function is also called the observation function of 
the input variable x^. It indicates the condition for which any change at the 
signal Xi can be observed at the output of /. An input p of a logic gate is said 
to have a controlling value if its value prevents other inputs from affecting 
the output of the gate. The controlling value for And-type gates (And, Nand) 
is 0, and the controlling value function is Control{p) = p. The controlling 
value for Or-type gates (Or, Nor) is 1, and the controlling value function is 
Control{p) = p. No controlling value exists for inverters and Xor gates. 



3 BOOLEAN MATCHING 

A recent and comprehensive survey of Boolean matching methods was pre- 
sented in (Benini et al 1995). For incompletely specified functions, the first 
algorithm for detecting a match under input variable permutation and/or 
input and output phase assignment was proposed by Mailhot (Mailhot et 
al 1993). This algorithm uses compatible graphs, but the size of these graphs 
is exponential in the number of input variables, and their application is thus 
limited to functions with a maximum of four inputs. Savoj (Savoj 1992) uses 
the tautology check, but this doesn’t solve the problem of finding the variable 
assignment. He introduced a class of filters that are valid even for incompletely 
specified functions. 

In (Trullemans et al 1996a) a new Boolean matching method was presented: 
the controlling value matching, which allows to consider only a subset of 
input variables, and prune the permutation tree as soon as these variables are 
rejected. Moreover, when a correct input variable permutation - even partial - 
is found, the corresponding input phase assignment can be directly deduced : 
a total of 2^ possible input phase assignments is saved. But this method may 
fail to recognize valid cells, when the cell structure is not a tree. We modify 
here the algorithm to validate it for don’t cares and general Dag structures. 



3.1 Controlling value Boolean matching 

The method is based on the equivalence of observation functions for equivalent 
functions (observability equivalence). These observation functions are applied 
on a structural* equivalent of the proposed cell, and this structure is scanned 
from the external inputs to the output. Inside the structure, every elementary 
gate is checked using the controlling value paradigm (controlling value check). 
Along the scan inside the structure, the observation functions of the internal 
signal lines are to be computed: this is called Observation Function Deduction, 
If there are multiple fanouts in the circuit, no exact deduction is possible, and 

* composed only with And-, Or- and inverter- type gates 
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only heuristics (Trullemans et al 1996a, Trullemans et al 1996b) could be 
applied. 

The next theorem asserts the equivalence of two functions from their ob- 
servation functions, and is the basis of the method: 

Theorem 1 Let f : B and g : B^ -¥ B be two Boolean functions. The 

function g is an equivalent to the function f , or to its complement^ if and only 
if Ag^. = A/a. . for any input variable X{ G x. 



Match(function /, cell p, permutation II = initial permutation) 

(1) { if ( scanline(/,p,n) == Found ) return match Jound; 

(2) return Match(/,p,II = next permutation;); } 

scanline (function /, cell p, permutation II) 

(4) { For each input Xi input observation functions */ 

(5) ^9Uxi ~ 

(6) For each gate ki in the cell g /* scan line the cell p */ 

(7) { Let p,p the gate inputs and let r the gate output; 

/* Controlling value check for p */ 

(8) If ( fanout (p) > 1 ) App = Multi_fanout.observability(p); 

(9) If ( Apq • Control{p) ^ 0 ) 

(10) If ( p is an input with unknown phase ) 

(11) If ( Ap^ • Control{p) ^0 ) return Wrong-permutation; 

(12) Else phase-invert (p); 

(13) Else return Wrong-permutation; 

(14) Else 

(15) /* Controlling check value for p */ ... 

(16) Ap^ = App + Ap^ } /* Deduction */ 

(17) return BDD-check(/,p,II); } 



Figure 1 Algorithm for controlling value matching 

The match algorithm (lines 1-2) is shown in figure 1. The initial permu- 
tation is derived from the BDD order. The permutation II assigns the input 
variables x of / to the input variables y of g. For each permutation, the scan- 
line algorithm is called. The library cell function g, represented by the circuit 
C , is proposed as an implementation of the function /. To locate the design 
errors within the circuit C, its primary inputs are initialized with the obser- 
vation functions derived from / (lines 4-5). The only acceptable errors here 
are missing inverters (wrong phase assignments) or wire exchanges at primary 
inputs (input variable permutations). For any other detected error, the library 
cell g is not a match for the given function / and the partial permutation is 
pruned. 

The structure of the cell g is analyzed from the primary inputs to the 
primary output (lines 6-16), following a ’scan line’ (figure 2a). The analy- 
sis consists of two basic phases: controlling value analysis (lines 8-15) and 
observation function deduction (line 16). 
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Figure 2 (a) Scan Line (b) Function / (c) Rejected Permutation (d) Ac- 
cepted permutation (e) Correct Phase 

(a) Controlling value analysis 

The controlling value analysis is used to check the correctness of each logic 
gate in this netlist. Let us consider a logic gate k{ in the cell g with two inputs 
p,q and one output r (figure 2a). We assume that the gate ki is an And-type 
gate or an Or-type gate, and that the observation functions of the gate inputs 
p and q with respect to the required / are known. If the gate input p is a 
primary input, its observation function is initialized with the one computed 
directly from /, otherwise it has to be deduced from the observation function 
of its fan-ins (see Section b). 

The controlling value check is computed by Apg • Control{p) {scanline algo- 
rithm, line 9). The observation function (Ap^) of a signal line q indicates the 
condition for which any change at the signal line can be observed at circuit 
output. The controlling value function {Control{p)) prevents other input q 
from affecting the gate output and the circuit output. Thus the check must 
be satisfied (= 0) if p and q are the correct inputs of gate ki, and the input 
p has a correct phase. If the controlling value is false, and p is not a primary 
input, then the partial permutation is not correct (line 13). If p i s a primary 
input, the negative controlling value check will be computed by Aq,Control(p) 
(line 10-13). If this check is true (= 0), we need to insert an inverter, and the 
input phase is found . Otherwise, the current input variable permutation is 
not correct. 

Example 1 Let us consider the function f and library cell g of figure 2b and 
2c. The observation functions of f are: 



^fxi = X2X3, = X1X3, A/^3 =X 1 +X 2 



Let us first check the partial permutation X 2 1/2 and X3 y\ (figure 2c). 
At gate 1, Ag^^ • Control{x 2 ) = ^9xs^ ^ 0 ^9xs * Control{x 2 ) ^ 0: 
the input X3 could be observed for any logic value of the input X 2 . This is not 
possible because gate 1 is an AND gate, and this permutation must be rejected. 
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Note that the input ys of cell g has not yet been assigned: we consider only a 
subset of the input variables. 

Now, we will try the partial inp ut variable pe rmutation x\ yi and X 2 ^ 2/2 
(figure 2d), which gives • Control{xi) = ^g^^xi = 0. The negative 
controlling value check for x\ is validated : an inverter is to be added at input 
xi. For X2j A^3j^ 'Control{x 2 ) = 0; the controlling value is validated, and the 
phase is correct. The phase assignment (figure 2e) is thus directly determined 
from this analysis: </>( 2 /i, 2 / 2 ) = (1?0) • 

(b) Observation Function Deduction 

To continue the scanning along the structure, we have to determine the ob- 
servation functions of the internal signal lines in the netlist equivalent of the 
cell. These will be deduced from the initial observation functions set at the 
primary inputs {scanline algorithm, line 16). 

Let us consider a tree circuit, and some particular gates ki and kj (figure 2a). 
If p and q are the correct inputs of gate ki, we want to deduce the observation 
function of its output r. We propose to approximate the observation function 
by Apr « Agr = Agp + Ap,. 

The next theorem shows that the controlling value check of the signal line 
r may be conserved with this approximation, and gives the exact solution for 
tree circuits: 

Theorem 2 In a tree circuit, consider a gate k{, with inputs p and q, and 
output r. If r and s are inputs of another gate kj, which output is t, we have 

Agr • Control(s) = 0 iflF • Control{s) = 0 

(c) Matching with don^t cares on tree structures 

Consider now a function / which is incompletely specified. We have to define 
the incompletely specified observation functions. 

Definition 1 Let A/°^ = (^;=i © fxi=o)(lDCxi=i 7^xi=o) = 

fDCxi=i "b fDCxi=o respectively the on-set and the don^t care set of the 
observation functions with respect to an input variable Xi, of an incompletely 
specified function f, with don^t care set foe- 



Only the on-sets of these observation functions are important. The matching 
algorithm will take into account don^t cares when the observation functions of 
g are initialized by A/®^ instead of A/^. . . For tree structures, if the controlling 
value check is false, there cannot be a match. Doing so, we will perhaps accept 
bad cells - rejected later by the BDD check - but never reject acceptable 
solutions. 
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(a) (b) (c) 



Figure 3 Reconvergence (a) after Q 2 (b) before Q 2 (c) Multiple fanout 



(d) Analysis of a Multiple Internal fanout Cell 

If there are multiple internal fanouts in the library cell, the observation func- 
tion deduction is more complex. If we know the observation functions for 
all the fanout branches, we can deduce the observation function of a fanout 
stem (De Micheli 1994). The deduction is computed by a simple network 
traversal, from the output to the inputs. In the network of figures 3a and 3b, 
the observation function of the signal line h can be computed as: 



Affft = ^9h2h=h®^9h, (1) 

In the matching process, we want to deduce the observation function of 
the fanout branches from that of the fanout stem, i.e. from the inputs to the 
outputs. Normally, no exact deduction is possible, and only heuristics could be 
applied. We will propose a new one, which can be proven as an approximation 
of the exact expression. Starting from the equation 1, we can write: 



^9h^ = ^9h^^9h2h=h 

~ ^9h ® (As/i2 • ^9s)h=h 

= ^9h^^h2h=h ^9h^9sh=h ^9h^^h2h=h^9sh=h 

In this expression, and Ash 2 are known, but not Ag^. We propose to 
take only the first term as an approximation of Agf^^ : 



^9h^ « ^9h^Sh^h=h 

This will always give results included in the complete solution, even if there 
are redundancies in the net. K there are more than two fanout branches, the 
approximation may be generalized, for n fanout branches as: 






n— 1 



^9h 

i=l 



= hn = h) 



(if n is even) 
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n-l 

(if n is odd) 

i=l 

Each term Apf^. may be replaced by {Ayif^. • ApyJ, Using the definition of 
the Xor function, the two expressions may be developed. Only the first term 
is kept, because the others depend on Apy. (where i = 2 to n), which are not 
known: 

n 

^9hn ^ ^9h * J\(^yihihi+i=...=hn=h^ 

i=l 



3.2 Library organization 

Before the technology mapping, a setup phase is used to process gates in 
the library and generate particular data structures called Boolean Primitive 
Classes (BPC). All the gates in a BPC are equivalent to a BPC representative 
in the sense that the function of each gate can be obtained by inverting inputs 
and/or output of the BPC representative. The BPC structure reduces the 
number of calls to the Boolean matching algorithm. 

The BPC structure contains some informations to improve the performance 
of the Boolean matching. These informations concern the characteristic sig- 
natures and symmetric variables that are used to reduce the total number of 
possible input variable permutations and the total number of possible input 
phase assignments, which may dramatically speedup the matching process. 

4 IMPROVING SIS WITH BOOLEAN MATCHING 

Covering and matching are two strong interrelated processes. Sis uses a tree 
structural check for matching, which leads to a pattern based algorithm for 
the covering step. The logic circuit to be mapped is first decomposed into a 
network of 2-input Nand (or Nor) gates and inverters. The covering algorithm 
splits the network into a forest of trees, which are covered by patterns that 
represent the cells in the library. In addition. Sis implements some ad hoc 
techniques that handles the particular case of cells with internal fanout, like 
Xor and Mux. The tree covering adopted by Sis traverses the network from 
the inputs to the outputs and, for each node, all the patterns in the library 
are checked for matching: the time complexity directly depends on the size of 
the library. 

In our approach, the matching algorithm is a Boolean one, which can find 
matches that are not detected by structural methods, and may exploit the 
degrees of freedom provided by don^t care conditions. Moreover, Boolean cov- 
ering and matching can handle the input/output phase of the library cells. 
This reduces the initial subject graph by avoiding the inclusion of inverters 
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pairs. The base function becomes a NPN* And function, and no additional 
inverters are needed. Xor, Mux and other cells could also be used as base func- 
tions, by adapting the decomposition algorithm: Thakur (Thakur et al 1996) 
adds 2-1 multiplexers, which increases the granularity of the base functions. 
His results show a gain in area, or a gain in delay without area increase com- 
pared to the usual Huffman based technology decomposition implemented in 
Sis. 



5 BOOLEAN COVERING 

In the Boolean mapping that we use, the nodes are grouped in clusters from 
the outputs to the inputs. A cluster is a connected sub-graph of the subject 
graph, having only one vertex with zero out-degree (a single output). It is 
characterized by its depth ( longest path from the root to a leaf) and number of 
inputs. A cluster function is associated to each cluster, and is matched against 
a subset of the gates, which are filtered out of the library through signature 
and symmetry analysis. In consequence, the time complexity depends on the 
number of clusters generated at each node, and not on the size of the library. 

The covering algorithm implemented in our package is based on a clustering 
process enhanced to deal with reconvergent fanout. The algorithm assumes 
that the network has been partitioned into subject graphs and decomposed 
into base functions beforehand. Its pseudo-code is shown in figure 4. 

Cover(node top, set nodes, set inputs) 

(1) if ( size{inputs) > max Jnput .cluster ) return; /* Prune Cluster Generation */ 

(2) forall(x, inputs) { 

(3) if ( X is not mapped ) 

(4) Cover(x, makejset(x), getJnputs(x)); 

(5) if ( X is internal node And fanouts(x) reconverge at top ) 

(6) Cover {top, nodes, new Jnputs)', } 

(7) if ( size(inputs) <= maxJnput jnatch) 

(8) library.match.boolean_class(top,inputs); 



Figure 4 Algorithm for network Boolean covering 

The parameter top is the root node for the covering process. According 
to the library cells available, it may be interesting to allow the generation of 
clusters with a larger number of inputs. This allows the detection of reconver- 
gent fanout cells that have a relatively small number of inputs and a larger 
number of internal signals. The cluster generation process will stop (line 1) 
when the number of cluster inputs reaches the max Jnput .cluster value. 
This controls the matching space exploration. The recursive algorithm com- 
putes all clusters for each node (lines 2-6) in the network rooted at top, and 

*If there is a permutation operator P and complementation operators Ni,No^ such that 
/(x) = Nog(PNix) is a tautology, then / and g belongs to the same NPN class. 
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try to find the best set of matches that reduces the cost function. For each 
generated cluster, the number of its inputs is used as a filter before calling 
the match algorithm (lines 7-8). The parameter max Jnput .match indicates 
the maximum number of inputs of a library cell, and controls the size of the 
clusters that may be matched. 

The cost function for a match (line 8) is given by the cost of the selected cell 
plus the cost of matching the cluster inputs, which have already been matched. 
In case of area optimization, the cell cost is only the cell surface. For each 
cluster input the best phase is selected. To handle reconvergent fanout (lines 
5-6), if all fanouts of the cluster nodes reconverge at the node top, the new 
cluster with internal fanout will be generated and taken into account. There 
are additional parameters that heuristically control the performance of the 
mapping. For example, the covering may be set to not collapse reconvergent 
fanouts. This is useful when the library contains only tree-like cells because 
it provides faster results. 

One more advantage of the Boolean covering is the possibility to extend the 
matching candidates by using the network don’t cares. Two main approaches 
for the Boolean covering problem with don’t cares were developed in the 
literature. In Ceres, Mailhot (Mailhot et al 1993) generates all clusters up to 
6 inputs. Don’t care handling in Ceres is restricted to 4 input variables. An 
alternative method was proposed by Savoj (Savoj 1992). It is based on the 
tautology check by BDD computation but does not restrict the cluster size of 
incompletely defined functions. However, the use of the full don’t care set for 
matching may slow down the mapping by two orders of magnitude. We don’t 
address this point here. 



6 RESULTS 

We ran a set of benchmarks from the Mcnc suite to evaluate our Boolean 
mapping approach compared to the Sis structural one, on an Ultra Sparc I 
workstation. The version of Sis, modified to include our Boolean mapping, is 
referred here as Land. All benchmarks, except c432, cml50a, x2, ttt2, were op- 
timized with the rugged script followed by a call to full simplify. The Boolean 
matching was processed without don’t cares. 

Table 1 shows a comparison between Land and Sis for the library syn- 
cho.genlib, which is distributed with Sis. It is a standard cell library which 
includes Xor2, 2-1 Mux and tree like cells up to 8 inputs. Columns G (gates) 
shows the cell count used by the mapped circuit, and columns T (time) shows 
the Cpu time in seconds. Column As for Sis presents the total area of the 
mapped circuit, and Columns Ai/As for Land shows the normalized total 
area (ratio Land/Sis). For a depth of 2, Land generates all clusters with that 
logic depth. This means that we may have gates up to 4 inputs. Land reported 
an average area gain of 7% and was in average 3 times faster than Sis. For 
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Bench. 

Name 


G 


Sis 

As 


T 


Land, Depth=2 
G ^ T 


Land, Depth=3 
G ^ T 


cml52a 


17 


528 


0.7 


15 


.92 


0.2 


16 


.95 


2.3 


z4ml 


31 


856 


0.9 


23 


.78 


0.3 


23 


.78 


0.6 


x2* 


40 


1176 


1.6 


40 


.97 


0.5 


36 


.92 


1.8 


cml50a* 


41 


1328 


1.3 


34 


.78 


0.8 


35 


.79 


0.8 


sao2 


95 


2728 


3.4 


93 


.95 


1.0 


86 


.91 


7.7 


C432* 


155 


4336 


5.2 


121 


.85 


1.4 


111 


.83 


6.2 


9symml 


156 


4828 


8.4 


147 


.95 


2.0 


133 


.91 


19.4 


C1355 


244 


7392 


6.6 


226 


.94 


2.2 


226 


.94 


2.9 


C880 


255 


7768 


7.8 


250 


.92 


2.6 


261 


.88 


13.0 


C1908 


264 


8016 


7.9 


261 


.93 


2.4 


267 


.95 


4.1 


ttt2* 


344 


10864 


19.6 


305 


.91 


4.7 


302 


.91 


37 


apex6 


476 


13376 


14.7 


474 


.97 


4.3 


441 


.95 


10.8 


Aver. 


100% 


100% 


100% 


94% 


93% 


29% 


91% 


91% 


136% 



*: not optimized. 



Table 1 Library synch, genlib on depth cluster 



a depth of 3, the maximum number of inputs is 8 (tree-like cluster). In this 
case, Land reported an average gain of 9% in a similar computing time. 



7 CONCLUSION 

This paper presents an efficient algorithm for Boolean mapping based on a 
fast Boolean matching approach. The Boolean matching is based on testing 
techniques which prune the space search during the matching step by early 
detection of unsuccessful matches. Applied to the area minimization problem, 
the benchmarks have shown both an area and performance gain with respect 
to the structural mapping of Sis. 

Interesting improvements will be considered in future work. The main cost 
in the Boolean mapping is related to the relative large amount of clusters 
that may be generated and evaluated. While libraries exhibit gates with a 
large number of inputs, those gates are not frequently used and account for a 
performance degradation. Specific filtering techniques will in that case increase 
the mapping speed. 

Our future research will focus on Boolean mapping with don’t care process- 
ing, mainly for use in low power applications. 
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Abstract 

Power dissipation has recently emerged as one the most critical design constraints. 
Data-dependent power management techniques are among the most effective for 
power reduction. Depending on some input conditions, the clock driving some of the 
registers in the circuit is inhibited, thus reducing the switching activity in the fanout 
of those registers. 

The use of data-dependent power management techniques creates some interesting 
testability problems. The signals used to inhibit the clock can dramatically reduce the 
observability of the nodes in the circuit. In this paper, we first describe an approach 
for the complete testing of power-optimized circuits using these techniques. We then 
present results that show that using state-of-the-art techniques, ATPG for the power 
managed circuit is not significantly more difficult than for the original circuit. 



Keywords 

Low power, power management, test pattern generation, propositional satisfiability 



1 INTRODUCTION 

Two aspects have combined to make power consumption one of the most critical 
design parameter. On one hand, the rapid increase in clock frequencies, chip com- 
plexity and scale of integration is creating significant heat dissipation problems. High 
operating temperatures can affect the circuit’s reliability and reduce the lifetime of 
the system. In order to dissipate the heat that is generated, special packaging and 
cooling systems may be required, leading to higher costs. 

On the other hand, we have the proliferation of portable devices. For personal 
communication applications like hand-held mobile telephones, battery lifetime may 
be the decisive factor in the success of the product. 

Power reduction techniques have been proposed at all design levels, from system 
to device. It has been demonstrated at the gate and system levels that large power 
savings are possible merely by cutting down on wasted power, commonly referred 
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to as power management. At the system level, this involves shutting down blocks of 
hardware during a period of time for which they are not being used. 

Several methods have been presented that perform the shutdown of a section of the 
circuit on a clock-cycle base. Depending on some input conditions, the clock driving 
some of the registers in the circuit is inhibited, therefore reducing the switching 
activity in the fanout of those registers. These techniques are referred to as data- 
dependent power management techniques. 

The use of data-dependent power management techniques creates some interest- 
ing testability problems. In order to detect the input conditions that allow for the 
disabling of the clock signal, some extra circuitry has to be added to the original cir- 
cuit. Since the functionality of the circuit is not being altered, this logic is redundant, 
thus dramatically reducing the observability of some of the nodes in the circuit. 

In this paper, we first describe an approach for the complete testing of power- 
optimized circuits using data-dependent power management. The approach we pro- 
pose is based on a two input vector sequence. The first input vector guarantees that 
we can always set the output of the latches to the required value, even if the clock is 
disabled in the second input vector. 

Using this testing strategy, we then present results that show that using state-of- 
the-art techniques, automatic test pattern generation for the power managed circuit 
is not significantly more difficult than for the original circuit. 



2 POWER MANAGEMENT TECHNIQUES 

In a synchronous digital system controlled by a global clock, a generally accepted 
model for the power dissipated by a gate is given by: 

Pavg ~ 0.5 X Cload ^ ^ E{tVQjTlsitiOTlS^ ^ (1) 

where Pavg denotes the average power, Cioad is the load capacitance, Vdd is the 
supply voltage, Tcyc is the global clock period, and E {transitions) is the expected 
number of gate output transitions per global clock cycle (Najm 1993). 

It follows directly from Equation 1 that one way to reduce power consumption is 
to reduce the switching activity E{transitions) in a circuit. So called power man- 
agement techniques that shutdown hardware for periods of time in which it is not 
producing useful data are effective methods of reducing the power consumption of a 
circuit. Shutdown can be accomplished by either turning off the power supply or by 
disabling the clock signal. 

System-level approaches identify idle periods for entire modules and turn off the 
clock for these modules for the duration of the idle periods (Chandrakasan, Sheng & 
Brodersen 1992, Chapter 10). Detection and shutdown of unused hardware is done 
automatically in current generations of Pentium and PowerPC processors. The trend 
is that future generation of processors will provide software controls for selective 
hardware shutdown, feature already available in the Fujitsu SPARClite processor. 
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Scheduling algorithms that maximize the shutdown period of execution units in 
a system have been presented (Monteiro, Devadas, Ashar & Mauskar 1996). Given 
throughput constraints and the execution units available, operations are scheduled 
such that those that generate controlling signals are computed first, thus indicating 
the flow of data through the circuit. Power is saved by only activating hardware 
modules involved in computing the final result, all other modules being shutdown. 

At the logic level, some shutdown techniques, such as precomputation (Alidina, 
Monteiro, Devadas, Ghosh & Papaefthymiou 1994), guarded evaluation (Tiwari, 
Ashar & Malik 1995) and gated-clock finite state machines (Benini, Siegel & Micheli 
1996), have been proposed recently and are among the most efficient power opti- 
mization techniques at this design level. These techniques are named data-dependent 
power management techniques as power management is achieved on a clock-cycle 
basis, function of the input conditions at the beginning of each clock cycle. 

The precomputation method (Alidina et al. 1994) adds a simple combinational cir- 
cuit (the precomputation logic) to the original circuit. Under certain input conditions, 
the precomputation logic disables the loading of all or a subset of the input registers. 
Under these input conditions, no power is dissipated in the portions of the original 
circuit with only disabled registers as inputs. We will analyse this technique in more 
detail in the next section. 

Guarded evaluation (Tiwari et al. 1995) identifies cones internal to the circuit that 
can be shut down under certain input conditions. In the process, it creates new tran- 
sition barriers (guards) in the form of additional latches or OR/AND gates. Instead 
of adding the precomputation logic to generate the clock disabling signal, guarded- 
evaluation uses signals already existing in the circuit. 

The gated-clock finite state machines (FSM)*s approach (Benini et al. 1996) is 
based on identifying self-loops in a Moore FSM*. If the FSM enters a state with a 
self loop, the clock is turned off. In this situation, the inputs to the combinational 
logic block do not switch, and thus we have virtually zero power dissipation in that 
block. When the input values cause the FSM to make a state transition, the clock 
signal is again allowed to function normally. Techniques to transform locally a Mealy 
machine into a Moore machine are presented so that the opportunity for gating the 
clock is increased. 

All these techniques achieve power reduction by stopping transitions at the inputs 
from propagating to combinational logic blocks. However, this has the undesirable 
consequence of making the testing of the circuit more difficult. Since for -some input 
combinations the loading of the registers is disabled, it is not possible to have all 
combinations at the inputs of the combinational logic blocks. 

We present a method for the automatic test pattern generation of data-dependent 
power managed circuits in Section 4. Although the problem is similar for the three 
techniques presented above, we will focus on the precomputation technique in the 
remainder of this paper. For this reason, in the next section we analyze this technique 
in more detail. 



* In a Moore FSM, the outputs are completely defined by the present state, whereas in a Mealy FSM the 
outputs depend both on the present state and primary input lines. 
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Figure 1 Circuit before precomputation. 

3 PRECOMPUTATION FOR LOW POWER 

Consider the circuit of Figure 1, where the combinational logic block A is bounded 
by registers R\ and While Ri and i ?2 are shown as distinct registers in Figure 1 
they could, in fact, be the same register (as in a FSM). 

The precomputation architecture is shown in Figure 2. The inputs to the block A 
have been partitioned into two sets, corresponding to the registers R\ and i? 2 - The 
output of A feeds the register R 3 . The Boolean functions gi and Q 2 are the predictor 
functions and have the same inputs as register R\ . gi and ^2 are defined so that: 



5i = 1 => / = 1- 


(2) 


52 = 1 =>■ / = 0. 


(3) 



Therefore, during clock cycle t if either gi or g 2 evaluates to a 1, the load enable 
signal of the register R 2 is set to be 0. This implies that the outputs of R 2 during clock 
cycle ^ -f- 1 do not change. However, since the outputs of register R\ are updated, the 
function / will evaluate to the correct logical value. A power reduction is achieved 
because only a subset of the inputs to block A change, implying reduced switching 
activity. 

Note that gi and g 2 add to the delay of paths that originally ended at Ri but now 
pass through gi or ^2 and the NOR gate before ending at the load enable signal of the 
register Ri. Therefore, caution should be used so that this transformation is applied 
on non-critical signals or logic blocks. 

The choice of gi and g 2 is critical. On one hand, we should include as many 
input conditions as possible in g\ and p 2 > i-^-» we want to maximize the probability 
of gi or g 2 evaluating to a 1. On the other hand, gi and p 2 correspond to extra 
logic that is added to the circuit, consequently increasing power consumption. To 
obtain reduction in power with marginal increases in circuit area and delay, gi and 
g 2 have to be significantly less complex than /. The precomputation architecture 
of Figure 2 ensures this by making gi and ^2 depend on significantly fewer inputs 
than /. Methods to automatically determine the precomputation logic of a circuit are 
described in (Alidina et al. 1994). 

Using these same principles, different precomputation architectures have been 
proposed in (Alidina et al. 1994, Monteiro, Rinderknecht, Devadas & Ghosh 1995). 
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Figure 2 Precomputation architecture. 

4 ATPG TECHNIQUES FOR POWER MANAGED CIRCUITS 

We address the problem of generating test patterns to detect faults in circuits us- 
ing data-dependent power management techniques. These techniques, overviewed 
in Section 2, all use mechanisms to prevent transitions in some logic signals from 
propagating to combinational logic circuits, thus reducing power consumption. As 
a consequence, some input combinations to the combinational logic block may no 
longer be possible, hence dramatically reducing the effectiveness of ATPG programs. 

4.1 Definition of the Problem 

We will assume that we have full controllability of the inputs and observability of 
the outputs of the original circuit, i.e., the circuit before the power management tech- 
niques have been applied. Our objective is to measure how the testability of the cir- 
cuit before (Figure 1) and after (Figure 2) power management compare. 

Suppose the combinational logic block A in Figure 1 can have a total of m stuck- 
at faults. Since we are assuming full controllability and observability of the inputs 
and outputs respectively, standard ATPG techniques can be used for test generation. 

After we apply power management, extra logic has been added to the circuit. In 
Figure 2, this corresponds to functions pi and p 2 and to the NOR gate. If the possible 
number of faults in this extra logic is fe, the total number of faults in the precomputed 
circuit will be m-ffc. However, to be effective, gi and p 2 are necessarily much simpler 
than A, therefore m -f- A; is not much larger than m. Thus, it is not the extra number 
of faults that makes test generation significantly more difficult. 

The major test generation problems arise from the fact that the precomputation 
logic will prevent some input combinations to the logic block from happening. For 
this reason, in order to fully test the power managed circuit, we may need two input 
vectors. We describe this approach in Section 4.3. 

Still, even using two input vectors for testing, the precomputation logic introduces 
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many redundant faults. Redundant faults can create difficulties for most ATPG pro- 
grams as they try to prove that the fault is indeed redundant. In Section 5, we describe 
an ATPG algorithm that can efficiently handle all the redundant faults introduced by 
the precomputation logic. 

4.2 Testing using Scan Techniques 

Before we describe the our testing approach based on a two input vector sequence, it 
is worth mentioning the case for circuits using scan test techniques. With scan tech- 
niques, the registers in the circuit are connected in series. During the testing process, 
these registers can be directly loaded with the desired values and their contents can 
also be read. We therefore have full observability and controllability of the inputs 
and outputs of the combinational blocks in the circuit. 

For circuits with scan techniques, precomputation does not create any significant 
additional testing problem as all registers can be directly loaded, thus circumventing 
the precomputation logic. The only extra concern is the test of the precomputation 
logic, which as previously stated should be a very small fraction of the total logic. 

However, there is some overhead associated with scan techniques. This overhead 
is generally too expensive for all registers in the circuit to be included in the scan 
chain. In practice, partial scan is used, where only a fraction of the register are in the 
scan chain. Under partial scan, a sequence of input vectors may have to be generated 
in order to set the output of registers not in the scan chain to some desired value. 
Hence, the motivation for our two input vector based testing approach that we present 
next. 



4.3 Testing using a Two Input Vector Sequence 

As described before, whenever power management is asserted, some input transi- 
tions are prevented from reaching a portion of some combinational logic block. This 
may impede certain input combinations at this logic block from happening. We pro- 
pose to solve this controllability problem by using a two input vector sequence. As 
observability of the outputs is assumed, the result of the test can be verified after the 
second input vector. 

There are two situations to consider. First, consider a fault that can be detected 
using some input combination that disables the precomputation logic. Then, the first 
input vector is of no relevance since the second input vector will be loaded into the 
input registers and thus we are able to set the inputs of the combinational logic to 
any value that is required to detect the fault. 

Now, consider that some fault can only be detected by input combinations that 
assert the precomputation logic. In this case, we build the first input vector such that 
precomputation is disabled and load the desired values to the registers disabled by 
precomputation. Next, the second input vector needs only have the correct values for 
the remaining registers since the first registers will already have the correct values 
and will not be disturbed since the precomputation logic will be active. 
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^n-1 
^n-1 




Figure 3 Circuit transformation. 



4.4 Generating the Two Input Vector Sequence 

We present a circuit transformation technique for the automatic generation of the two 
input vector sequence used in the testing of power managed circuits. This will allow 
us to use combinational ATPG techniques, thus avoiding the more computationally 
expensive sequential test generators (Abramovici, Breuer & Friedman 1990). 

The proposed transformation is shown in Figure 3, obtained from the precom- 
puted circuit of Figure 2. We have duplicated the inputs x of the original circuit, 
corresponds to the first input vector and x^ to the second. 

The subset of inputs used in the precomputation logic (in the case of Figure 2, x\ 
and X 2 ) will always be present at the input of the combinational logic, therefore this 
subset of x^ goes directly to block A after transformation. 

The remaining inputs of the second input vector x^ go to the precomputation logic 
and the output of this logic will decide if these values of x^ (no precomputation for 
x^) or those of the first input vector x® (circuit being precomputed for x^) are the 
inputs to A, 

Note that the values x® corresponding to the inputs used in the precomputation 
logic are not defined by the transformation. Yet, for the transformation to make sense, 
we have to make sure that they disable the precomputation logic so that x® can be 
loaded to the registers. This condition is also shown in Figure 3. 

We can now run standard ATPG tools on the circuit after transformation to obtain 
values for x® and x^ that detect any fault in the original combinational logic block 
A and precomputation logic gi and ^ 2 - However, we have duplicated the number of 
inputs, increasing dramatically the input search space of the ATPG tool. Additionally, 
the circuit after transformation will have a significant amount of redundancy, further 
complicating the problem for the ATPG tool. In the next section we describe an 
ATPG tool that can efficiently handle this problem. 
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5 ALGORITHMS FOR ATPG 

As described in the previous sections, data-dependent power management based on 
precomputation can introduce a large number of redundant faults. In addition, the so- 
lution of using two input vector sequences for detecting faults in the resulting circuit 
potentially duplicates the search space. As a result, circuits using data-dependent 
power management are expected to be significantly harder for test pattern genera- 
tion tools. This is indeed the case and traditional ATPG algorithms are likely to be 
unable to yield acceptable fault coverages for circuits using data-dependent power 
management. In particular, this is the case with the D-algorithm, PODEM, FAN and 
SOCRATES (Abramovici et al. 1990) and with recent implementations of these al- 
gorithms (Lee & Ha 1993). 

Nevertheless, given the relationship between Propositional Satisfiability (SAT) 
and ATPG, recent work on efficient search algorithms for SAT (Silva & Sakallah 
1996) can potentially enable the development of ATPG algorithms specifically suited 
for circuits with many hard-to-detect faults. 

5.1 Satisfiability-Based ATPG 

It is well-known that fault detection problems can be cast as instances of SAT 
(Larrabee 1992, Stephan, Brayton & Sangiovanni-Vincentelli 1996). Basically, the 
valid assignments to the nodes of a circuit can be represented by a Conjunctive Nor- 
mal Form Formula (CNF). For ATPG, we just need to consider two replicas of a 
given circuit, one denoting the good circuit and the other denoting the faulty circuit 
(on which the given fault must be activated). By creating the OR of the XOR’s of 
the primary outputs of the two circuits, and by requiring the output of the OR gate 
to assume value 1, we create a satisfiability problem whose solution is a test pattern 
for the given fault (Larrabee 1992). In general, additional information is added to the 
CNF representation in order to prune the amount of search. 

It is generally accepted that SAT-based ATPG algorithms have a few significant 
drawbacks. First, representing each fault detection problem as an instance of SAT 
is extremely time-consuming. Indeed, known experimental results indicate that CNF 
formula creation can take as much as 75% of the total testing time (Stephan et al. 
1996). Second, since all clauses in a CNF formula must be satisfied, test patterns 
may become over specified. Consequently, SAT-based ATPG algorithms may yield 
test sets larger than necessary. 

Despite these drawbacks, SAT-based ATPG algorithms are particularly versatile in 
that improvements to SAT algorithms can be readily applied and extended for ATPG. 

5.2 The GRASP SAT Algorithm 

The GRASP SAT algorithm is detailed in (Silva & Sakallah 1996), and it is basically 
a backtrack search algorithm. However, GRASP is able to analyze the causes of 
conflicts, i.e. situations of the search in which one or more clauses have all literals 
set to 0. Analysis of the causes of conflicts can be used for implementing several 
powerful pruning techniques: 
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• By analyzing the causes of conflicts we can backtrack directly to the cause of each 
conflict. Hence, GRASP implements non-chronological backtracking. 

• Given that the causes of conflicts are identified, they can be recorded as new 
clauses, and so these new clauses can be used to augment the original CNF for- 
mula. Hence, we can prevent known conflicts from being identified again during 
the search. 

• Careful analysis of the structure of conflicts permits identifying variable assign- 
ments which are deemed necessary for a solution to be found. Since GRASP 
identifies more necessary assignments due to the causes of conflicts, the search is 
further pruned. 

As illustrated in (Silva & Sakallah 1996), new pruning techniques can be eas- 
ily incorporated into GRASP. In addition, preliminary experimental results strongly 
suggest that GRASP is one of the most efficient SAT algorithms for highly structured 
instances of SAT. 

Moreover, instances of SAT obtained from fault detection problems for stuck-at 
or bridging faults are highly structured, and GRASP performs particularly well on 
these benchmarks. As a result, circuits having a significant number of hard-to-detect 
faults are potentially amenable for a ATPG tool based on GRASP. The experimental 
results given in Section 6 strongly support this motivation. 

The proposed ATPG algorithm, named TG-GRASP, basically encodes fault detec- 
tion problems using the approach given in (Stephan et al. 1996) and uses GRASP as 
the back-end search engine. In addition, a few additional features have been incor- 
porated into TG-GRASP: 

• To prevent the over-specification of test patterns, TG-GRASP implements syn- 
tactic satisfiability, which permits early identification of sufficient conditions for 
satisfiability before satisfying all clauses in the CNF formula. This technique can 
be viewed as equivalent to restricted forms of dynamic head line identification 
that can be identified by circuit-based ATPG tools (Silva & Sakallah 1994). 

• Because GRASP records new clauses during the search as by-product of conflict 
analysis, some of these clauses, in particular the ones solely associated with vari- 
ables in the good circuit, can be re-used for subsequent faults, thus potentially 
pruning the amount of search for subsequent fault detection problems. We refer 
to these clauses as pervasive clauses. 

Syntactic satisfiability reduces computation time for detectable faults, whereas 
pervasive clauses, because they add additional constraints to the search, potentially 
reduce the computation time for all faults. 



6 EXPERIMENTAL RESULTS 

In this section we compare the testability of a subset of the circuits in the MCNC’91 
combinational benchmark set before and after adding data-dependent power man- 
agement. In Table 1 we present the statistics for the circuits used. Under the column 
Original we give the number of primary inputs (PI), the number of primary out- 
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Table 1 Statistics of the circuits and power reduction through precomputation. 



Circuit 






Original 




Precomp. 


% 


Name 


PI 


PO 


Gates 


Lits 


Power 


Lits 


Power 


Red. 


9symml 


9 


1 


157 


327 


1837.1 


40 


1487.4 


19.0 


apex2 


39 


3 


195 


390 


2322.7 


4 


1201.2 


48.3 


comp 


32 


3 


105 


188 


1712.6 


13 


720.9 


57.9 


comp 16 


35 


3 


221 


413 


2597.8 


17 


907.9 


65.1 


cps 


24 


102 


1001 


1979 


4518.8 


26 


2879.5 


36.3 


dalu 


75 


16 


827 


1611 


7014.2 


20 


3638.4 


48.1 


duke2 


22 


29 


351 


716 


2243.7 


20 


1527.6 


31.9 


e64 


65 


65 


251 


379 


2554.7 


5 


573.7 


77.5 


i2 


201 


1 


156 


369 


7748.5 


22 


2520.0 


67.5 


misex3 


14 


14 


533 


nil 


3473.1 


2 


2503.9 


27.9 


seq 


41 


35 


1246 


2590 


7494.2 


0 


3923.1 


47.7 


tooJarge 


38 


3 


234 


471 


2396.4 


1 


1561.1 


34.9 



puts (PO), the number of gates (Gates) and literals (Lits) in the circuit and its power 
dissipation (Power). 

Also in Table 1, we show the power savings obtained after precomputation is ap- 
plied to each circuit. Under Precomp . , we give the number of literals in the precom- 
putation logic and the power of the power managed circuit. We can see that the size 
of the precomputation logic is in general much smaller than the original circuit. The 
last column of Table 1 indicates the percentage savings in power obtained through 
precomputation. As we can observe, power reductions of upto 77% are possible. 

To measure the testability of the circuits, we have used two SAT-based ATPG 
tools, TEGUS (Stephan et al. 1996) and TG-GRASP, described in Section 5. The 
CPU times reported are for a Sun 5/85 machine with 64 MByte of physical memory. 
All tools were compiled with the same optimization options. 

In Table 2 the ATPG results for the original MCNC’91 benchmark circuits are 
shown. The transformation described in Section 4.4 was applied to the precomputed 
circuits and the ATPG programs were run on the modified circuit. The ATPG results 
after precomputation are given in Table 3 In each table F, D, R, A and CPU denote, 
respectively, the total number of faults, the number of detected faults, the number of 
redundant faults, the number of aborted faults and the CPU time for each tool. For 
each benchmark all faults are targeted. This solution permits a larger set of faults to 
be studied and guarantees that each ATPG tool is presented with exactly the same set 
of faults. 
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Table 2 ATPG results for the MCNC original benchmark circuits. 



Circuit 


F 




TEGUS 






TG-GRASP 


Name 




D 


R 


A 


CPU 


D 


R 


A 


CPU 


9symml 


752 


750 


2 


0 


15.9 


750 


2 


0 


18.3 


apex2 


948 


945 


3 


0 


77.1 


945 


3 


0 


25.3 


comp 


480 


479 


1 


0 


6.2 


479 


1 


0 


6.6 


comp 16 


960 


960 


0 


0 


15.7 


960 


0 


0 


17.2 


cps 


4642 


4640 


2 


0 


146.6 


4640 


2 


0 


199.5 


dalu 


3742 


3740 


1 


1 


313.8 


3740 


2 


0 


201.0 


duke2 


1708 


1708 


0 


0 


41.7 


1708 


0 


0 


53.7 


e64 


1142 


1136 


0 


0 


27.3 


1142 


0 


0 


34.6 


i2 


762 


760 


2 


0 


17.1 


760 


2 


0 


18.7 


misex3 


2590 


2583 


7 


0 


105.8 


2583 


7 


0 


130.2 


seq 


5912 


5908 


4 


0 


1375.4 


5908 


4 


0 


378.6 


too -large 


1132 


1117 


15 


0 


27.5 


1117 


15 


0 


35.0 



For both the original and precomputed circuits, TEGUS and TG-GRASP have 
comparable and acceptable performance in all circuits. Nevertheless, for a few cir- 
cuits the pruning techniques used in TG-GRASP make the difference and lead to rea- 
sonably smaller CPU times. Note that for benchmark t/a/w, TEGUS aborts one fault. 
These results, and the fact that TG-GRASP aborts no faults, illustrate the robustness 
of the TG-GRASP algorithmic solution. Even though precomputation introduces a 
large number of redundant faults, there exist ATPG tools which can test the resulting 
circuits with 100% fault coverage. 
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Table 3 ATPG results for the MCNC benchmark circuits with precomputation. 



Circuit 


F 




TEGUS 






TG-GRASP 




Name 




D 


R 


A 


CPU 


D 


R 


A 


CPU 


9symml 


932 


890 


42 


0 


160.2 


890 


42 


0 


65.6 


apex2 


1492 


1243 


249 


0 


142.1 


1243 


249 


0 


79.5 


comp 


956 


749 


207 


0 


28.1 


749 


207 


0 


26.0 


comp 16 


1448 


1228 


220 


0 


173.8 


1228 


220 


0 


53.4 


cps 


4996 


4844 


152 


0 


457.9 


4844 


152 


0 


351.2 


dalu 


4794 


4274 


519 


1 


513.1 


4274 


520 


0 


468.1 


duke2 


1970 


1847 


123 


0 


85.1 


1847 


123 


0 


104.9 


e64 


2122 


1686 


430 


0 


128.0 


1692 


430 


0 


139.0 


i2 


3696 


2371 


1325 


0 


425.1 


2371 


1325 


0 


357.7 


misex3 


2792 


2696 


96 


0 


145.9 


2696 


96 


0 


170.1 


seq 


6536 


6259 


277 


0 


838.7 


6259 


277 


0 


593.4 


tooJarge 


1706 


1430 


276 


0 


74.9 


1430 


276 


0 


92.5 
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Abstract 



Given a set of data words or messages to be transmitted over a bus such that the sequence 
(order) in which they are transmitted is irrelevant, we address the problem of determining 
the optimum sequence that minimizes the total number of transitions on the bus. Since 
busses take up significant fraction of chip- area, the bus capacitances are often considerable; 
then the bus power may account for as much as 40% of the total power consumed on the 
chip (Designer 1995). Thus exploiting the freedom to resequence the data words can lead 
to substantial power savings. 

This problem arises during the flushing of a cache and transmission of packets over a 
channel. We also show how some power minimization problems in scheduling during high- 
level synthesis, in instruction-sequencing for embedded applications, and in die testing can 
be cast as the data ordering problem. We prove that the data ordering problem is NP- 
complete. Nevertheless, we propose two polynomial-time algorithms to approximate the 
optimum solution to within a constant factor. The first algorithm gives a solution to within 
a factor of 2 from the optimum, and the second within a factor of 1.5, but at an additional 
cost that is a function of the word-size. Experimental results confirm that resequencing data 
using the proposed algorithms leads to significant reduction (by 36%) in switching activity 
and hence power savings. 

Keywords: Low Power Design, System-Level Design. 



1 MOTIVATION 

With the proliferation of portable electronic devices, low power has become an important de- 
sign objective. Among the suite of emerging techniques for low power designs are pipelining 
and parallelization (Chandrakasan, Sheng& Brodersen 1992, Duncan, Swamy & Jain 1993), 
reduction of supply voltage (Chandrakasan, Sheng & Brodersen 1992), and minimization of 
switching activity. The problem of minimizing the switching activity in a circuit has been 
addressed at various levels: system level (Chandrakasan, Sheng & Brodersen 1992, Bunda, 
Athas & Fussell 1994, Chandrakasan, Potkonjak, Rabaey & Brodersen 1992), logic or gate 
level (Roy & Prasad 1992, Shen, Devadas, White, Ghosh & Keutzer 1993, Tsui, Pedram 
& Despain 1993, Tiwari, Ashar & Malik 1993), layout level (Chao & Wong 1994, Hirendu 
Sz Pedram 1993, Cong, Koh & Leung 1994), transistor level (Tan & Allen 1994), etc. In 
this paper, we examine the problem of reducing switching activity on high-capacitance bus 
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lines. In several applications such as microprocessors and DSPs, busses account for almost 
40% of the chip area; 30-40% of the power consumption on a chip is due to bus-power 
(Designer 1995), i.e., the power consumed in charging/discharging the often large bus ca- 
pacitances. 

More precisely, given a set of data words or messages to be transmitted over a bus such 
that the sequence in which they are transmitted is irrelevant, we address the problem of 
determining the optimum sequence that minimizes the total number of transitions on the 
bus. Consider the following situations where resequencing (or reordering) can lead to power 
reduction. 

Cache write- back: Consider a computer system with a main memory and a cache. 
On a context switch, often the cache (main memory) is “flushed,” i.e. the data residing in 
the cache (main memory) has to be written back to main memory (secondary memory). 
Or, in a distributed memory multi-processor system the entire contents of a local cache 
may be written back over a global bus to memory and the local caches of other processors. 
In all such applications the order in which the lines are written back is often irrelevant. 
For minimizing the power dissipated during the transmission, the problem is: in which 
order should the words he transferred such that the bus lines have the fewest transitions 
(0 to 1 or 1 to 0)1 Not only should one consider transitions on the data bus, but also on 
the address bus. Since average power consumed is directly proportional to the expected 
switching activity, fewer transitions result in lower power consumption. 

Example 1 Consider 3 cache locations, with the following addresses and data. 



address 




data 


ao = 0111 


WQ 


= 0001 


ai = 1000 


Wl 


= 1110 


02 - 1001 


W2 


= 0011 



If the order in which the data is transferred from the cache is wo,w\,W 2 , we can compute 
the total number of transitions on the address and the data lines as follows. For wq to wi, 
there are 4 transitions on the address bus, and 4 transitions on the data bus. Thus, in all 
8 transitions. Similarly, from w\ to W 2 , there is 1 transition on the address bus and 3 on 
the data bus, making it a total of 4- For the entire transmission, the number of transitions 
is then 12. However, if the data is transferred in the order wq,W 2 ,wi, the number of 
transitions reduces to 4 F 4 = two-thirds of the previous count. 



The problem then is to determine the order in which the data words should be transferred 
such that the total number of transitions on the data and the address busses are minimized. 

Scheduling in high-level synthesis: Consider the scheduling problem in high-level 
synthesis, where high-level operations have to be scheduled in different clock cycles subject 
to resource constraints. For instance, consider FIR (finite-impulse response) filter that 
computes xqCq +a7ici -\-x 2 C 2 -f-xacs, where X{S are signal inputs, and c^s are fixed constants. 
Assume we are constrained to use only one multiplier for all the four multiplications. We 
will need to use multiplexors/busses at the two inputs of the multiplier that will select 
appropriate XiS and CtS corresponding to the multiplication operations in different clock 
cycles. Since in the FIR computation, multiplications can be executed in any order, there is 
an opportunity to schedule them on the single available multiplier so that the total number 
of transitions at the multiplier inputs is minimized over the four multiplications.* Consider 
the multiplier input fed by the constants Cjs. If the multiplications are scheduled as xqco 
followed by xici, followed by X2C2, and then X3C3, the transitions seen at this multiplier 
input over one sample period will be the sum of the number of bits switching from cq to ci , 
Cl to C2, C2 to C3, and C3 to cq. The C3 to cq transition is considered because the next sample 



*It has been observed that the transitions within the multiplier are positively correlated to 
the transitions at the multiplier inputs. 
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Figure 1 Transmitting a set of words 



period starts with the multiplication xqCq. For a different schedule, the transitions will be 
different. Thus we discover the problem of word sequencing (the words are the constants 
c^s) in this context. Note that for the second input of the multiplier, which is fed by the 
input signals Xi, it is not possible to determine the transitions a priori^ since the inputs 
can assume any values. It may be possible to run several simulation inputs and compute 
the average number of transitions between X{ and Xj. Then, the scheduling can be carried 
out so as to minimize the number of transitions at both inputs. 

Instruction sequencing: In compiled code, there often arise sets of instructions that 
can be executed in any order without altering the behavior of the program. Therefore, for 
low power applications, it is possible to resequence the instructions, thereby reducing power 
consumption in fetching the instructions from memory (Tiwari, Malik & Wolfe 1994). 

Die testing: The first steps of IC manufacturing involve etching the dies on a wafer 
and testing them. The defective dies are removed and the good ones are packaged. During 
testing, input vectors are applied to the bare die, which h 2 is very little capacity for power 
dissipation. Thus the die is stressed much more during this step than at any other step in its 
life-time (Chakravarty h Dabholkar 1994), and may fail during test session, thus reducing 
the die-yield. This reduction is significant, especially in larger and higher density circuits 
such as multi-chip modules. This leads to the problem of power minimization during die 
testing. One way to lower power consumption is to reorder the test vectors (for combina- 
tional circuits) and test sequences (for sequential circuits) so that the switching activity in 
the die is minimized. Since switching activities at the circuit inputs and within the circuit 
are strongly correlated, it is worthwhile to minimize the activity at the circuit inputs, say 
by reordering the tests. A modified version of the test ordering problem W 2 is addressed in 
(Chakravarty & Dabholkar 1994), where the power consumed in the circuit under test is 
also explicitly modeled. 

Transmitting a set of records: Consider a transmitter- receiver system, transmitting 
and receiving a set of data records over a bus (see Figure 1). There are applications where 
it does not matter in what order the records are sent. For instance, if each record carries 
its own identification, the ordering of the records can be arbitrary (the receiver figures out 
which record is which by examining the identification label of the record). However, to 
minimize power dissipation during transmission, it may be that one ordering is better than 
another. It is useful that the records be sent in an order that minimizes the total number 
of transitions over the entire set of records.* 

Underlying all the above scenarios is the problem of determining the optimum order 
in which to transmit data words such that the total number of transitions is minimized. 
Optimization for low power by exploiting the freedom to order the items to be transmitted 
is a hitherto unstudied and unutilized optimization possibility; the savings resulting can 
be substantial — especially for transmission over busses, since bus capacitances, especially 
off-chip, can be very large. 

We distinguish two kinds of situations. In certain applications such as instruction se- 
quencing and transmitting a set of records, the data to be transmitted is known and fixed a 
priori, unlike in some other applications, such as cache write- back, where the data changes 
at different time instances. The algorithms we propose can be used in both cases, but they 
would be more effective in the first case, where the algorithms can be simply applied to 



•However, in many applications, each record is a set of data words and not a single word. 
This issue is addressed in Section 7. 
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Figure 2 An example graph and its minimum-weight spanning tree 



reorder the data statically, without incurring any area and power cost (this is akin to opti- 
mization at compile-time). However, when the data is not known a priori, the algorithms 
have to be implemented in silicon and then invoked dynamically, akin to optimization at 
run-time. The extra area and power consumption of this silicon can be justified only if 
the bus capacitance and the frequency of transmission are high enough — as in the case of 
data-intensive, distributed-memory applications. 

The rest of the paper is organized as follows. In Section 2 we formally state the data 
ordering problem for minimum transitions. Graph-theoretic preliminaries are covered in 
Section 3. Section 4 addresses the complexity of the data ordering problem for minimum 
total transitions. We prove that the corresponding decision problem is NP-complete. In 
Section 5, we present three heuristics to solve the problem. Experimental results and im- 
plementation details are covered in Section 6. Section 7 extends the basic data ordering 
formulation to some other applications. Proofs not provided in this paper can be found in 
(Murgai, Fujita Krishnan 1995). 



2 PROBLEM DEFINITION 

We are given a set of data words wi,W 2 , • . . ,Wn, each k-hit long, that are to be trans- 
mitted over a bus. The goal is to minimize the total number of transitions over the entire 
transmission interval. 

If word Wr is transmitted, immediately followed by Wg , the total number of transitions 
is given by the number of bits that change. This is d(wr,Ws) = '^rj sometimes 

called the Hamming distance between Wr and Wg. Here, Wrj denotes the bit of Wr- 
Here 0 denotes the EX-OR operation. For instance, Hamming distance between 11001 and 
10010 is 3, i.e., d(11001, 10010) = 3. 

This data ordering problem (DOP) can be restated as: ''find a permutation a of the 

n— 1 

words {wi,W 2 ,^-.,Wn} such that the total number of transitions ^ 

t=i 

minimized '' 



3 PRELIMINARIES 

We give a few definitions and basic results from graph theory that will be used in the paper. 
The reader who is familiar with them may skip this section. 

A spanning tree of a graph G — (V,E) is a subgraph of G that is a tree and spans all 
the vertices of G. If the edges have weights, it makes sense to talk of the minimum-weight 
spanning tree of G, where the weight of a tree is the sum of the weights of the edges in 
the tree. It turns out that a simple, greedy algorithm of starting from an empty tree and 
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Figure 3 Determining an Eulerean cycle 



repeatedly selecting the minimum- weight edge and adding it to the partially generated tree 
if the tree still remains acyclic generates a minimum-weight spanning tree (Nemhauser k. 
Wolsey 1988). For the weighted graph of Figure 2 (A), the minimum- weight spanning tree 
generated by this algorithm is shown in Figure 2 (B). 

A Hamiltonian tour of a graph G = (VyE) is a walk (simple path) with the same 
beginning and end points that visits each vertex of V exactly once. A Hamiltonian path of 
G is a walk with different beginning and end points that visits each vertex of V exactly 
once. For instance, 1, 2, 3, 4, 6, 5, 7, 1 is a Hamiltonian tour of the graph of Figure 2 (A), 
whereas 1, 2, 3, 4, 6, 5, 7 is a Hamiltonian path. If the edges have weights, we have the 
problems of finding a minimum-weight Hamiltonian tour (better known as the traveling 
salesman problem tour or TSP-tour) and minimum-weight Hamiltonian path (TSP-path). 
Both of these are known to be NP-complete (Garey & Johnson 1979), even for complete 
graphs. 

A graph G = (V, E) in which there may be more than one edge joining a pair of nodes is 
called a multigraph. An Eulerean cycle of a multigraph is a walk with the same beginning 
and end points that contains each edge of the graph exactly once. For example, in Figure 
3, the tour 1, 2, 5, 3, 4, 6, 7, 1 is an Eulerean cycle. The following claLssical result gives 
necessary and sufficient conditions for the existence of an Eulerean cycle. A multigraph 
contains an Eulerean cycle if and only if each node is of even degree. The following simple 
procedure finds an Eulerean cycle in such a multigraph. 

Procedure 1 (Finding an Eulerean cycle) Start at any vertex v of G. Traverse any edge 
incident on v whose removal does not disconnect the remaining (untraversed) graph.* Let 
this edge be (v^w). Delete (v,iy). Repeat the procedure from w. Terminate when no more 
edges remain in the graph. This can happen only when at v. The order in which the edges 
are traversed yields the Eulerean cycle. 



Consider the graph G of Figure 3 (A). Since all the vertices are of even degrees, G has an 
Eulerean cycle. Let us arbitrarily start from vertex 1. Since removal of edge (1,2) leaves 
G connected, we come to vertex 2, and delete the edge (1, 2). Similarly, from 2, we go to 
vertex 5 and delete (2,5). Now, we have a choice of whether to traverse the edge (5, 7) 
or (5, 3). If we traverse (5, 7), the remaining graph is disconnected, as shown in Figure 3 
(B). However, if we traverse edge (5, 3), the remaining graph is still connected, as shown in 
Figure 3 (C). So, we traverse (5, 3). Thereafter, there are no more choices to exercise. The 
Eulerean cycle is then 1, 2, 5, 3, 4, 6, 5, 7, 1. 

Given a complete graph G = (V^E) and a spanning tree T = (F, E') of G, the following 
procedure constructs a Hamiltonian tour on G. 



Procedure 2 (Constructing a Hamiltonian tour) Construct the multigraph G from T by 
duplicating each edge e G E' . Since each node of G is of even degree, G contains an Eulerean 



If initially each node in G is of even degree, the algorithm always finds such an edge. 
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cycle U . Construct U using Procedure 1. Delete all the node repetitions from U except for 
the final return to the first node. The resulting node sequence HT is a Hamiltonian tour 
on G. 



Note that any permutation of vertices of a complete graph yields a Hamiltonian tour. 
However, we will be interested in a minimum-weight Hamiltonian tour, for which we will 
start from a minimum- weight spanning tree T and use Procedure 2. 

Let d(S) denote the total weight of the tour (path, tree) 5, where the weight of a tour 
(path, tree) 5 is the sum of the weights of the edges in 5. 

If there is a non-negative weight d{vi,vj) on the edge (vi^vj) of G, and if in addition, 
the weights satisfy triangle inequality, then it is easy to see that d(HT) < d(U). This is 
because while obtaining HT from G, all the node repetitions are being deleted. For instance, 
if V1V3V4V1VQ is a subsequence of U, deleting the second occurrence of v\ is equivalent 
to replacing the edges {v4^v\) and (t'i,t^6) by (v4,ve). Because the triangle inequality is 
satisfied, d(v4yVi)-\-d{v\^VQ) > d{v4 jVq). Hence, the weight of the tour HT is no more than 
the weight of the tour U . Also, d(U) = 2d(T). So we get. 



d{HT) < 2d{T) (1) 

We will use this fact later. 

A matching in a graph G = (V^E) is a subset E' C E such that no two edges of are 
incident to each other. A perfect matching is a matching that is incident to each vertex in 
V. Clearly, IV] must be even for a perfect matching to exist. In many c£ises, the edges of 
the graph have weights on them. One optimization problem is to find a minimum- weight 
(perfect) matching in G. There exists an algorithm for this problem that runs in time O(n^) 
(Papadimitriou & Steiglitz 1982). 

Consider the graph of Figure 2 (A). It does not have a perfect matching, since it has odd 
number of vertices. However, {(1, 2), (3, 5), (4, 6)} is a matching, wherezis {(1, 2), (3, 5), (3, 4), 
(6,7)} is not, since edges (3, 5) and (3, 4) share the vertex 3. 



4 DATA ORDERING PROBLEM IS NP-COMPLETE 

In this section, we show that the data ordering problem for minimum number of transitions 
is hard. 

As a warm-up, we first show that DOP can be transformed to the minimum-weight 
TSP-path (or Hamiltonian path) problem in a complete graph G (which is known to be 
NP-complete (Garey Sc Johnson 1979)). Construct a complete graph G = (V^E), whose 
vertices Vi are in one-to-one correspondence with the data words Wi . Edge (vi^Vj) hzis weight 
d(wi ,wj), the Hamming distance between the words wi and Wj. Then, the minimum- weight 
TSP-path in G yields a permutation of the vertices and thus of the data words, for fewest 
transitions. 

This, however, does not imply that DOP is difficult. The edge weights in the general 
TSP-path problem can be arbitrary, whereas they are not arbitrary in the case of DOP 
since they are derived from the codes of the vertices, i.e., the data words (see Section 5). 

The Data Ordering Problem (DOP) stated as a decision problem is: 

INSTANCE: A set of n A:-bit data words, , 11/2 » • • • » , where each Wi E {0, 1}^ where k 
is a positive integer, and a positive integer L. 

QUESTION: Is there a permutation <7 of i^i, 1 ^ 2 ? • • • » i-e. a 1 — 1 functions : {1, . . . , n} — >• 

n— 1 

{!,..., n> such that ^ c^(it^a(0 » ^<r(i+i) ) < LI 

t=i 

Proposition 1 DOP is NP-complete. 



Proof. DOP is e€isily seen to be in NP. 
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We transform a special C 2 ise of the MANHATTAN PATH-TSP problem (Papadimitriou 
1977) to DOP. MANHATTAN PATH-TSP can be stated as follows: 

INSTANCE: A set P C X of points in the plane, a positive integer B* 
QUESTION: Is there a TSP-path of length B or less in the associated complete graph on 
\P\ vertices, where the distance between the vertices corresponding to the points (a?i, 2 ;i) 
and (X 2 ,V 2 ) is \xi - X 2 \ + \yi - 2 / 2 1 ? 

Note that the Manhattan distance between two points corresponds to the sum of the 
horizontal and the vertical distances between them. 

The special case we consider is called MANHATTAN PATH-TSP POLYNOMIAL CO- 
ORDINATES. It is as follows: 

INSTANCE: A set P C X of p points in the plane, a positive integer B. The x and 
y coordinates of each point have values at most cp, where c is a constant. 

QUESTION: Is there a TSP-path of length B or less in the Eissociated complete graph on p 
vertices, where the distance between the vertices corresponding to the points (o 7 i , yi ) and 
(x 2 ,y 2 ) is \xi - X 2 \ + I2/1 - 1/2I? 

MANHATTAN PATH-TSP POLYNOMIAL COORDINATES is NP-complete. It follows 
from the NP-completeness proof of MANHATTAN PATH-TSP proposed by Papadimitriou 
(Papadimitriou 1977). The instance of MANHATTAN PATH-TSP he creates in the proof 
is actually an instance of MANHATTAN PATH-TSP POLYNOMIAL COORDINATES: it 
satisfies the polynomial-coordinate property. 

From MANHATTAN PATH-TSP POLYNOMIAL COORDINATES instance, we will 
generate a DOP instance consisting of |P| data words, one for each point in P. Each data 
word will have maxi{a:i} + maxj{i/j} = A + Y bits, and will essentially be a concatenation 
of the unary codes of the x and the y coordinates of the corresponding point. For example, 
if xi = 6 , 2/1 = 4, and X = max,{xi} = 8 , Y = maxi{ 2 /i} = 5, the data word wi will 
be the concatenation of 00111111 (which is the unary representation of 6 - two zeros to 
the left are added to make all data words of equal length) and 01111 (which is the unary 
representation of 4 modulo the padded zeros), i.e., 0011111101111 . It can be easily seen 
that the Manhattan distance between two points in the set P is the same 21 s the Hamming 
distance between the corresponding data words. Hence, a TSP-path of length B or less 
exists if and only if there exists a permutation with associated total number of transitions 
at most B (thus, the DOP parameters are n = \P\,k = A + Y, L = B). Note that this 
is a polynomial-time transformation. In particular, note that since for any point (a:, y) in 
Pj Xyy < cp, we obtain A, Y < cp. Thus, the length of each data word is at most 2 cp, a 
polynomial in the size of the MANHATTAN PATH-TSP POLYNOMIAL COORDINATES 
instance. This completes the proof of NP-completeness of DOP. D 



5 APPROXIMATION ALGORITHMS 

As shown in the leust section, the data ordering problem is intractable. However, we adapt 
well-known approximation algorithms to provide solutions with guarantees of closeness to 
the optimum. 

The key to guaranteeing the bounds is that the Hamming distance d satisfies the triangle 
inequality j i.e., if wiyW 2 jWs are three data words, then d(wi^W 2 ) d{w 2 ^ws) > d(wi^ws). 
We make use of this in the sequel. 

The approximation algorithms we propose are adaptations of the well-known algorithms 
proposed for finding a minimum-weight TSP tour (or minimum-weight Hamiltonian tour) 
when the weights satisfy the triangle inequality (Nemhauser & Wolsey 1988). As pointed out 
in Section 4, DOP can be formulated in terms of the minimum- weight Hamiltonian path. 
Thus, we have to suitably transform minimum-weight tour algorithms to minimum-weight 
path algorithms. 



•The original problem is stated with P C Z X Z, but there is no loss of generality in 
replacing Z with Z"*" . Z"^ denotes the set {0, 1,2,...}. 
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5.1 Double Spanning Tree (DST) Heuristic 



1. Construct a complete graph G = (V',E), where each vertex Vi corresponds to the data 
word wi, and the weight of the edge (viyVj) is d{wi^wj), 

2. Find a minimum- weight spanning tree T in G. 

3. Apply Procedure 2 to find a Hamiltonian tour HT. This procedure first duplicates each 
edge of T, finds an Eulerean cycle U using Procedure 1, and then constructs HT from 
U. 

4. Delete the longest edge e in HT to get a Hamiltonian path HP. The sequence of vertices 
in this path corresponds to the desired permutation. In fact, there are two permutations 
depending on the starting point for the sequence, and both work. 

Let d{S) denote the total weight of the tour (path, tree) 5, as defined in Section 3. 



Proposition 2 Let HP* be the optimum (i.e., minimum-weight) Hamiltonian path in G. 
Then, d(HP) < where n is the number of the vertices in G, and HP is 

the Hamiltonian path produced by the double spanning tree heuristic. 



Thus, the permutation generated by the double spanning tree heuristic comes within a 
factor of of the optimum. 



5.2 Spanning Tree/Minimum Matching (ST-MM) Heuristic 

This heuristic also first constructs a minimum- weight spanning tree. However, to construct 

an Eulerean cycle, instead of duplicating each edge of the tree, it tries to add fewer edges. 

It works 3S follows. 

1. Construct a complete graph G = (F, E), where the vertices Vi correspond to the data 
words Wi^ and the weight of the edge (v», vj) is d(wi,wj). 

2. Find a minimum-weight spanning tree T in G. 

3. Identify the set O of vertices in the tree T with odd degrees. There must be an even 
number of such vertices. 

4. Find a minimum-weight perfect matching M on G(G) = (0,E(0)), the subgraph of G 
induced on O.* Such a matching exists, since \0\ is even and G(0) is a complete graph. 

5. Consider the graph T U M. Since each edge of M connects two different odd-degree 
vertices of T, T U M is a connected graph in which all vertices have even degrees. This 
implies that there exists an Eulerean cycle C/ in T U M. Construct U using Procedure 
1. From I/, construct a Hamiltonian tour HT by deleting the node repetitions, exactly 
as in Procedure 2. Delete the longest edge e in this tour to get a Hamiltonian path HP. 
This path corresponds to the desired permutation. 



Proposition 3 Let HP* be the optimum Hamiltonian path in G. Then, 

d{HP) < -|- ^], where k is the number of bits in a data word, and HP is 

the Hamiltonian path produced by the spanning tree/perfect matching heuristic. 



5.3 Greedy Heuristic 

The DST and the ST-MM heuristics first form a minimum- weight spanning tree, then an 
Eulerean cycle, and finally a Hamiltonian path. In contrast, the greedy heuristic is very 



*G(0) is a subgraph of G on the vertices O and has exactly those edges of G that have 
both their end-points in O. 
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simple. It always works with a path P and at every step extends it towards a Hamiltonian 
path. 

It works as follows. Like the previous heuristics, it also constructs the complete graph 
G. Then it selects the minimum-weight edge, say (i, j), of G. The path P after this step 
is (Li). Next, the minimum-weight edge incident on an end-point of P (i.e., on i or j) is 
selected. Let it be (LA;). Thus P becomes (/c,Li). Thus, at each step the heuristic greedily 
selects the minimum-weight edge incident on an end-point of P that is not incident on an 
internal vertex of P. The heuristic terminates when all the vertices of G are in P, in which 
case a Hamiltonian path has been constructed. As the experimental results in Section 6 will 
show, the greedy technique performs the best. 

We note that other algorithms used for TSP can also be applied. Some of these are 
nearest neighbor, nearest insertion, m-exchange, etc. (Nemhauser&: Wolsey 1988). However, 
no performance bounds are known with these heuristics. 



6 EXPERIMENTAL RESULTS 



The objective of our experiments is to evaluate the effectiveness of the proposed resequenc- 
ing algorithms in minimizing the switching activity. The experimental set-up is as follows. 
For a set of n A;-bit data words chosen from a uniform distribution, we count the total 
number of transitions using the following resequencing schemes: 

• random: Words are transmitted in the same order as they were generated. Since the 
words were generated randomly, we call this scheme random. 

• greedy: Words are transmitted in an order determined by the greedy heuristic of Section 
5.3. 

• DST: Words are transmitted in an order determined by the double spanning tree heuris- 
tic of Section 5.1. 

• ST- MM: Words are transmitted in an order determined by the spanning tree-minimum 
matching heuristic of Section 5.2. 

For each n, we consider different values of k. 

The results are shown in Table 1. Each entry in the columns corresponding to the four 
schemes is the total number of transitions for transmitting the entire set of n words. The 
greedy scheme performs the best. On average, it yields 36% fewer transitions than random. 
Average power being directly proportional to the expected number of transitions, 36% fewer 
transitions translates to a power saving of 36% on the busses. Since bus-power can be as 
high as 40% of the power consumed on the chip, we can expect about 15% reduction in the 
total power. 

On average, DST and ST- MM are respectively 29% and 33% better than random. Al- 
though theoretically DST and ST-MM have a predictable worst-case performance, they are 
not as good as greedy. The reason is as follows. Both DST and ST-MM construct an Eu- 
lerean cycle U before constructing the Hamiltonian path. The final result is sensitive to the 
choice of U. In our implementation, we pick one amongst several possible Eulerean cycles 
of the graph. In the future, we plan to characterize the sensitivity of the final Hamiltonian 
path with respect to the starting Eulerean cycle. 

One nice implication of the simple greedy scheme being the best is that for applications 
where the data words are not known a priori (such as in the cache write-back application 
and inputs Xi of the multiplier in the scheduling application), the hardware will not be very 
complex. Also, the power overhead will be small, since the capacitance driven by this extra 
hardware is negligible as compared to the bus capacitance. 



7 EXTENSIONS AND CONCLUSIONS 

We consider some applications where the previous heuristics can be applied, with minor 
modifications. 

In the cache write-back problem of Section 1, the transitions on both the address and 
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n = # words 


k — word-length 


random 


greedy 


DST 


ST-MM 


10 


5 


22 


10 


11 


10 


10 


10 


46 


23 


29 


23 


10 


20 


87 


72 


74 


78 


10 


40 


190 


149 


157 


157 


40 
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99 


21 


26 


25 


40 


10 


192 


87 


106 


94 


40 


20 


390 


225 


248 


243 


40 


40 


784 


567 


608 


591 


40 


60 


1171 


893 


955 


923 


80 


10 


371 


149 


191 


162 


80 


20 


780 


426 


484 


455 


80 


40 


1555 


1075 


1192 


1131 


80 


80 


3135 


2439 


2620 


2502 


80 


100 


3919 


3163 


3351 


3233 


100 


10 


496 


174 


220 


190 


100 


20 


1006 


515 


618 


558 


100 


40 


2001 


1326 


1450 


1373 


100 


80 


4019 


3049 


3335 


3185 


100 


100 


4969 


3894 


4162 


3981 


100 


150 


7482 


6139 


6437 


6348 


200 


20 


1998 


957 


1221 


1051 


200 


40 


4077 


2490 


2878 


2646 


200 


80 


7916 


5767 


6291 


6053 


200 


100 


9906 


7520 


8289 


7847 


200 


150 


15001 


12027 


12955 


12407 


200 


200 


19996 


16463 


17477 


16881 


total 




91608 


69620 


75385 


72147 



Table 1 Effect of Data Resequencing on Number of transitions 
# words number of randomly generated data words 
word-length length of each data word 
random total of transitions using a random ordering 

greedy total # of transitions using a greedy ordering 

DST total # of transitions using ordering by DST heuristic 

ST-MM total # of transitions using ordering by ST-MM heuristic 

total sum of # of transitions over all the examples 
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the data busses have to be collectively minimized. This is simply done by concatenating 
the address and the corresponding data and treating them together as a super wordy and 
applying the previous algorithms on the super words. 

Next consider the problem of scheduling for high-level synthesis, as described in Section 
1. For the FIR filter example, note that the problem of scheduling the multiplications is not 
that of finding a minimum- weight TSP-path, but rather that of finding a minimum- weight 
TSP-tour. This is due to the cyclic nature of scheduling: after all the multiplications have 
been performed in a sample period, they are performed in the same order in the next sample 
period. Thus we have to include the number of transitions between the Cj corresponding 
to the last multiplication in the current sample period and the Cj corresponding to the 
first multiplication in the next sample period (which is the same as the first multiplication 
in the current sample period). For the TSP-tour of minimum weight, the approximation 
algorithms described in (Nemhauser & Wolsey 1988) can be used directly. These are the 
same as those in Section 5, except that the final result is the tour HT. 

Some other simple extensions are transmitting a set of records where each record is a set 
of bytes (instead of being a single byte), the bus having a fixed initial and/or final state, 
etc. 

We are currently studying the utility of resequencing in situations where the data is not 
known a priori y for instance, during cache write-back. One way to solve the problem is 
to invoke the algorithm when the system is actually running - implemented in hardware. 
However, it is hard to estimate in advance the associated area, timing, and power penal- 
ties. On the other hand, in some applications, such as the CDFG data binding/scheduling 
problem (Deisgupta & Karri 1995), it may be possible to estimate the switching activity by 
using the input signal probabilities. In such czises, the problem becomes identical to that in 
which the data is known. 

We have left unaddressed applications where the data cannot be re-ordered: it has to be 
transmitted in the same order it is given. Here we may be able to re-encode the data at 
the transmitter and suitably decode it at the receiver to minimize the transitions over the 
channel. We are currently working on this problem. 
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Abstract 

Adder architectures are presented here by an unified formalism, and analysed from 
the delay, complexity and power consumption points of view. An analytical model 
for the power consumption is derived, assuming that it is proportional to the 
transition density [DHNT95]. The model is subsequently validated by simulation 
using a signal transition probabilities propagation tool [Cra89]. Finally, glitches 
are taken into account when transitions at the input of a cell are separated by one or 
more cell delays. A redundant to total power ratio is also derived. 

Keywords 

Adder, BDD, glitch threshold, low power, spurious transition, switching activity 



1 INTRODUCTION 

Addition is the most frequently used arithmetic primitive, involved not only in 
simple addition but also in more complex operations like multiplication and 
division. The present study covers the linear ripple carry adder and different 
architectures of carry select and carry lookahead adders. 

Designing low-power high-speed circuits requires a combination of techniques at 
four levels :• technology, circuitry, architectures and algorithms [BCS92]. This 
work concentrates on the architecture level and considers a CMOS static 
technology. 
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The paper is organised as follows: the A operator introduced by Brent and Kung 
[BrKu82] is first recalled. Then it is used to describe several well known adder 
architectures by an unified formalism. An analytical power consumption model is 
derived first for the ripple carry adder, and extended to other architectures. The 
notion of Glitch Threshold is then introduced and validated by HSPICE 
simulations, providing a glitch filtering model usable by the power evaluation 
software [Cra89]. Finally, concluding remarks and a brief presentation of the future 
work is given. 



2 THE A OPERATOR 

Let us consider an adder that computes S = A + B, where A = 

B = At every position i, the next carry Cj+i is either generated, i.e. 

Ci+i = L propagated, i.e. cj+i = c\ or killed, i.e. c\+\ = 0 according to the values of 
the digits ai and bj. So three signals can be defined, one for each case: gi = aj a bi. 

Pi = ai © bj, and k, = aj v bi . 

Then let us note Pi j the group propagate and Gi j the group generate, with 
n-1 > i > j > 0. Pi j means that the carry propagates from position j up to position 

i, that is that Ci+i is equal to cj. Pj j = ni =iPn- Gi j means that a carry is 
generated somewhere between j and i and propagated from this location up to 

position i and yields q+i = 1. Gj j = gi v Xi=i(**i.n+i ^8n)- 

Clearly, one has Pj j = pi = aj © bj , Gj j = gi = ai a bj , Pj j a Gj j = 0 and 
Ci+i = Gj 0 - For any k such that n-1 > i > k > j > 0, the pair of bits (Pj j, Gj j ) 
can be computed from (Pj Gj ) and (Pk-ij . ^k-i j) in the following way: 

(Pj j , Gjj) = ( Pj,k A Pk-l,j , Gj,k Pi,k ^ Gk-i,j ) (1) 

Is noted A the operator such that: 

(Pi,j . Gjj) = (Pj,k , Gj,k) A (Pk.ij , Gk.i j). (2) 

In the subsequent figures the icon is used for the 4 bit input, 2 bit output 
A-cell. 

It is easy to prove that: 

A is associative, non commutative and idempotent. 

Any (Pj j , Gj j) requires (i-j-1) A-cells to be computed from the 
adders inputs. Intermediate results from the A-cells may be reused, thus reducing 
the total number of A-cells, but increasing the fan-out of some of them [Zim96]. 
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3 . COST AND DELAY MODELS : WORST CASE 

The cost of an n-bit adder consists of a linear cost to compute the gj and pj from aj 
and bj and the sj from pi and Gj.j q plus a cost varying according to the 

implementation chosen to get the Gj.j q . This cost is given roughly by the 
number of A-cells, that may range from (n-1) up to (n log2n) for regular adders or 

even go up to 1/2 (n-1)^ for special purpose adders [Zim96]. 

Note that the Pi_i,o never used, so the A-cells at the bottom of the following 

figures produce only the Gj.j q output and since the Pj j are only useful for the 
right input of those cells, all the n-1 A cells at the bottom of the figures are 
simplified. This saving is accounted for in the fixed cost . 

The adder delay is the sum of the delays of the A-cells along the critical path plus 
a fixed delay to get the gj and pj and finally the Sj. In the following, the delay of a 
A-cell is used as the delay unit. 

3.1. Some Adder Architectures 

Let us examine now some well-known architectures [GBB94], their delay (number 
of A-cells along the critical path), and their cost (the total number of A-cells). 

3.1.1 Ripple Carry Adder 

The ripple carry adder (figure 1 ) delay and its cost are in O(n-l). It is inefficient and 
easily constructed by mere abutment of A-cells. 

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 

Figure 1 : A 32-bit carry ripple adder 

3.1.2 Two Level Carry Select Adder (2’CSA) 

The two level carry-select- adder (figure 2), also named conditional- sum-adder or 
carry increment adder is based on the previous one truncated into blocks of varying 

sizes. Its cost is in 0(2n) and delay Of , more precisely with k A-cells along 

the critical path, an adder can accommodate up to 1 + ^ i = ~ k(k + 1) bits. 



28 27 26 25 24 23 22 21 20 19 18 17 1615 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 
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3.1.3 Brent and Rung Adder 

The Brent and Kung adder [BrKu82] is based on binary A-cell trees. The cost is 
0(2n), the delay 0(2f log 2 (n)l -2). One binary tree outputs the G, q for all i in the 
form 2i-l, then another tree gives the remaining Gj q . 

3.1.4 Sklansky Adder 

The Sklansky adder [Skla60] has proved to be the fastest architecture. Its cost is 

Of ~ — ^ 1, and its delay Of log 2 (n)l. The main drawback is that the fan-out 

grows exponentially from the inputs to the outputs along the critical path and 
consequently the transistors must be sized. 

3.1.5 Kogge & Stone and Han & Carlson Adders 

The most significant bit of a Brent and Kung adder as well as in a Sklansky adder is 
obtained by a perfectly balanced binary tree in time log2(n). If the tree for the most 
significant position is just copied for all other positions, the Kogge and Stone 
adder [KoSt73] is obtained. The fan-out is reduced to just two, at the expense of a 
larger number of A-cells, that becomes 0(n(log2(n) - 1) + 1) cells. As for the 

Sklansky adder, the delay is Of log2nl . 

In order to reduce the number of cells of the Kogge and Stone adder, Han and 
Carlson [HaCa87] have proposed to compute only the odd positions, and then to 
add a layer to compute the even positions from the odd ones. The delay is slightly 

increased to Of log 2 (n)l +1, while the complexity becomes 0~(r log2(n)l +1). 



3.2. Comparison 

Table 1 [TVG951 



Adder 


of A-c ells 




Max. fan-out 


Useful Activity 


Ripple 


n -1 


n -1 


2 


nl2 


2-CSA 


r In -'^Pln 1 


\<2n^ 


[<2n^ 


\2n42nV2 


3-CSA 


5/2 n -3log2(n/2) 


rfei 


rfei 


N.A. 


B&K 


r 2n - log 2 (« )1 


r21og2(n)-2l 


T2 log2(«)-2l 


N.A. 


Sklansky 


[n/2 log 2 (n )1 


r log2(n )1 


n/2 


« /4 log2(n )1 


K&S 


\n (log 2 (n )-l)+ll 


riog2(n)l 


2 


=Tn /2 log2(« )1 


H&C 


r«/2 log2(« )+ll 


riog2(n)l +1 


2 


=r« /4 log 2 (n )1 



4 . ACTIVITY MODEL FOR THE RCA 

In this part of the paper, a model for the activity of a Ripple Carry Adder (RCA) is 
derived without taking into account the attenuation of the spurious transitions. In 
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the ripple carry adder, when all the inputs are applied at once, the activity is mainly 
due to the propagation of the carry through a chain of Pi = 1. Let us call T(n,k) the 
number of different chains of k consecutive “l”s in a binary word of length n : It is 
obvious that : T(n,0) = 0 (no “zero bit” chain), T(n,n) = l (i.e. 111... 11) and 
T(n,n-1) = 2 (i.e. 2 possibilities : Oil. ..Ill or 11... 11 10). 

Let us now compute the general term T(n,k) for 0 < k < n . Since the word 
extremities as well as the bit value 0 act as chain separators, we distinguish two 
cases. When the chain touches one of the two extremities of the n-bit word, there 
are different values for the n - (k + 1) bits outside the chain. 

11. . .10 Oil... 01 

k+1 n-(k+l) 

There are n - (k+2) possibilities for the chain to be in the middle of the word and 
for each position there are values of the n - ( k + 2) remaining bits. 

01. . .10 on. ..01 

k+2 n-(k+2) 

Thus T(n,k) = 2""^^l + ^^^ — ~ for0<k<n, T(n,0) = 0 and T(n,n) = 1 



4.1. Activity of the RCA 

In the case of the RCA, none of the outputs is obtained from a balanced binary 
tree, and thus, the activity window of any output is equal to its logical depth. This 
is not true for other architectures where the outputs are obtained by balanced binary 
trees like the Kogge and Stone. 

The activity caused by a carry propagation over k positions is proportional to 
k^/2 [MoPa96]. Thus the average activity is 



A(n) = ^ILo Y ■ T(n,k) . 

Let us recall some useful identities [Kre93] : 



(3) 



Y" i .2- =2-4, Y" i^2-‘=6-4, X"n i'.2-‘=26-4 
-^ 1=0 2 ” “^ 1=0 2 ” -^ 1=0 2 ” 

since they allow to simplify the expression of 



(4) 



A = 



3n-4 3n" 



3n-4 



>n+3 n->« 



( 5 ) 



With these identities, one can also easily verify that : 
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^x;.„k.T(n,k)=i- 



3n 



2 2 



n+2 



n 

> — 

n-^oo 2 



( 6 ) 



which is the known average delay of the ripple carry adder. 

In the following, the higher order terms are neglected, i.e. it is assumed that : 
3n — 4 

A = . Table 2 shows the relative error for 8, 16, 32 and 64 bits. 



Table 2 



# of bits 




Delay (%) 


ri(%) 


8 


9.3750 


2.34 


8.2759 


16 


0.15 


0.02 


0.04 


32 


8.94e-06 


5.59e-07 


4.26e-06 


64 


8.33e-15 


2.60e-16 


2.58e-06 



Due to the equiprobability of the output vectors, the average number of useful 
transitions in a RCA is equal to half the number of cells. The total activity A is 
split in two parts ; A = Auseful + Aredundant- Thus, the ratio if] of redundant over 
total activity is: 



y. _ ^redundant _ A _ U 4 

A A 3n-4 



(7) 



For large values of n, t] = 1/3. This result is consistent with the BDD 
simulations using a unit delay and with [LMJ95]. 

By adopting this approach, it is also possible to determine the acticity at a given 
time tj (Figure 3). 

Ripple chain of length k 



Figure 3: Activity at time tj. 

The activity at ti is given by : A(t,) = ^.^^^jk.T(n,k). The ripple chains 

of length k that exist at t2 are those of length k+1 at tj , thus : 

^(^2) = ■^•Xk='i l^ T(n, k + 1). The sum goes to n-1 because in a word of length 

n, one cannot have a ripple carry chain of length greater than n. 

More generally, the activity at time tj is given by : 



A(ti) = ^.i::ik.T(n,k + i) = 2-‘(^i^ + 2 



( 8 ) 
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5 . EXTENSION OF THE MODEL TO OTHER ARCHITECTURES 



The previous model is extended to the adders that can be obtained by an association 
of ripple carry chains - like the carry select adder or the Kogge and Stone adder for 
example. 

In the computation of ?], the useful activity Auseful is assumed to be half the 
number of A cells since “0” and “1” are equiprobable (this is consistant with the 
BDD simulations). 

5.1. Two Level Carry Select Adder 



The 2-CSA adder is a RCA truncated into blocks. For n bits, the length of these 
blocks varies from 1 to ^^2n - 1. Thus a n bits 2-CSA can be viewed as ^^2n 
RCAs of length varying from 1 to ^|2n - 1 (first level) plus a row of cells that 
form the second level (figure 2). 



5, LI First level 

The total activity of the first level of the 2-CSA is given by the sum of the 
activities of the ripple carry chains, as they are independent from each other. 



^1*“ level 






Lin 



\ 3i-4“ 
1 . 




3.y^ 

8 




5.1.2 Second Level 

The second level of the 2-CSA is approximated here by a ripple carry chain of 

length V^, in which, a cell at position i is duplicated i times. This approach 
neglects the acticity generated at the second level by the ripple of the outputs at the 
first level. The activity of such a ripple carry chain can be deduced from the activity 
of the RCA by assuming that the capacitance of the k^*^ cell is k instead of 1 . 



^2'“* level 



1 

^*2rfk=l 



k. 




6.V^-57 

8 



(10) 



5.7.5 Total Activity of the 2-CSA 

The total activity of the 2-CSA is the sum of the activities of the first and second 
levels : 

-A A 3.V^ + 4.n.V2n -6.n-57 

Total 1“ level 2“* level g (K) 

Assuming that the useful activity is given by half the number of cells, the 
redundant to total activity ratio can be computed: 
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„ _A-A.^,j7.V2^ + 4.n.V2i-14.„-57) 

A ■ 3.V2n+4.n.V2»-6n-57 ' ' 

5.2. Kogge and Stone adder 

Each bit of the Kogge and Stone adder is obtained by a balanced binary tree, thus 
the output of any cell can change only once during a clock cycle > no redundant 
transitions. 

The carry propagation is the result of a logical AND, thus its activity decreases 
very rapidly with the depth (like 2‘^, but the transition probability of the carry 
generation is almost constant (1/2). These considerations allow us to approximate 
the activity of the Kogge and Stone adder by half the number of its cells : 

^K&s “ "^^^82 ^ » ai^d (13) 



6. GLITCH THRESHOLD 



HSPICE simulations has been carried out on simple circuit examples (Figure 4) in 
order to quantify the spurious transitions absorption or propagation. The notion of 
Glitch Threshold is introduced here in order to quantify when a glitch becomes 
a spurious transition 




Outl 




Figure 4: Spurious transition generation and propagation 



When the transitions at the inputs of a gate are separated by a delay 5, a glitch is 
generated at “Out” (figure 4). The amplitude of this glitch is proportional to 5 
(figure 5). This glitch can be either absorbed, or propagated depending on its width 
and on the delay of the following gate. As it can be seen in figure 5, there is a 
threshold for the glitch propagation from “Out” to “Outl”. 

HSPICE measurements have been carried out (ATMEL-ES2 ECPD07 
technology) and ploted. The plot shows that there is a threshold delay under which 
the glitch is absorbed, and above which the glitch grows into a spurious transition 
(figure 5). We call this threshold delay Gth- 
The variation of Gth with respect to T (the buffers’ delay) is linear (figure 6), and 
the slope is approximately 1.89. This means that a glitch of width 8 = Gth will 
become a spurious transition only if 8 > 1.89 T. The glitch threshold depends on 
the loading capacitance of OUTl. The threshold phenomenon is attenuated when 
the capacitance is large. 
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Figure 5 : Glitch Threshold 
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Figure 6 : Variation of Gth with T. 
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This characteristically behaviour of spurious transitions has been implemented in 
a BDD simulation tool [Cra89], and in the following section, the above analytical 
model is compared to the simulations for different adders architectures. 



7 . BDD SIMULATION AND RESULTS 



The tool used for implementation and experiment is ASYL+ [Cra89]. It provides a 
complete environment for macrogeneration and low-level synthesis. The size of the 
operands is sufficient to build a netlist for any kind of adder architecture. The 
mapping and the estimates are then performed with the user library, delay and 
dissipation model. 



[ = 0 
□ t = I 
■ t-2 

m t = 



P P2 

So,S| Pq 

^2 ■ a 

S,Pj 

! S2T2 
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•SqSiPo ^ 



Figure 7 : Switching probabilities example 



The power is dependent on the circuit structure as well as the circuit inputs: it is 
said to be input pattern-dependent. To solve this problem, one can simulate the 
circuit for a large number of inputs and then average the switching activity. On the 
other hand, probabilities where introduced [Bur88] to perform the averaging before 
running the analysis [Najm95] by estimating the number of transitions per clock 
cycle. Using the Boolean network functionality and connectivity, these input 
probabilities are propagated through the network. To apply statistical properties, 
reconvergent fanouts and feedback have to be taken into account. A convenient way 
to do this is to use binary decision diagrams [Najm91]. As the adder architectures 
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do not contain any reconvergent fanouts, the probability computation is performed 
without any approximation. 

A glitch is created at the output of a gate because of the difference in arrival 
times at its inputs. Then, the glitch can be propagated to the fanout gates according 
to their sensitivity. The probability of a switch due to a glitch cannot be estimated 
the same way as a useful switch for the simple reason that the probability of a node 
to undergo a transition does not depend only on the Boolean network functionality 
but also on its structure (path lengths for instance). 

The formula Pj = 2Pi(l-Pi) in no longer valid any more for all the possible 
transitions. To solve this problem, the probability calculation must be based on 
real delay models. Since the switching probabilities are supposed to be known at 
the primary inputs of the circuits, they are propagated to the fanout gates up to the 
roots of the circuit using the gate delays. Each gate modifies the switching 
probability according to its ability to propagate the transition from its inputs to its 
output, what we called sensitivity. The sensitivity calculation rests on the 
functionality of the gate according to the probabilities at the inputs. Finally, gates 
have a set of switching probabilities, distant from each other according to the glitch 
threshold previously introduced, which are added. A simple example of carry ripple 
adder is illustrated in figure 7. 




bit 



Figure 8 : HSPICE/BDD Power dissipation vs. number of bits 

A switch is represented by a square, coloured according to its occurrance time. Its 
probability is a function of the input switching probabilities Pj and the 
sensitivities Sj of the cells yet encountered. The authors propose a gate level 
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estimator based on these remarks in [Lau96]. It gives close dissipation estimation 
to an exhaustive simulation in classical combinational circuits. 

The automatic simulation is consistent with the analytical model previously 
exposed which rests on unit delay and capacitance, and with a power dissipation 
function linear with respect to the fanout capacitances. However, technology 
mapped adders have realistic delays as well as a more complex dissipation model at 
each switching, including for instance the charging of internal capacitances. As a 
consequence, we built, at transistor level, and simulated a A-cell with HSPICE. 
The submicron technology used is ATMEL ES2 ECPD07. Once the elementary 
cell is fully characterised, the synthesis tool estimates the total power dissipation 
of classical adder architectures. These estimate are presented in figure 8. 



8. CONCLUSION 

In this paper the most frequent adder architectures were compared from the activity, 
delay and cost points of view. The originality of the approach is that the estimation 
of the activity was achieved analytically by implicitly exhaustive enumeration of 
all vectors. This is possible thanks to the properties of the A operator. Redundant 
transitions were taken into account, and a redundant to total power ratio was 
derived. Finally, glitch filtering is taken into account by feeding the BDD tool with 
technology driven HSPICE simulation results. 

In this article it is assumed that all the inputs are ready at the same time, and that 
all the outputs are desired at the same time. This is not always the case, especially 
if an adder is associated with other operators that have their own delays. For 
example in multipliers or dividers, the inputs arrival times are accessible to 
simulation, thus different adder architectures adapted to these conditions should be 
examined in order to match the best power-delay-cost trade off 
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Abstract 

A method to reduce power dissipation by automatically synthesizing gated-clocks in 
synchronous static CMOS circuits is presented. This synthesis is performed on the gate 
level description of the circuit. The boolean behavior of the inputs of the flip-flops is 
determined by examining the network. This behavior is represented in ROBDD’s. 
Analysis of these equations results in the condition for which flip-flops do not need 
to be clocked. Flip-flops are grouped in so called hold domains, and clocked by a 
gated-clock signal. Power reductions of up to 29% are found. There is only a small 
area overhead (less than 8%). Testability of the resulting design is taken care of. 

Keywords 

Integrated digital circuit design, low-power design 
1 INTRODUCTION 

Due to the continuously decreasing feature sizes and the increasing clock frequencies 
on integrated digital circuits, power dissipation is growing to be one of the major con- 
cerns during the design of an integrated circuit. Examples of this phenomenon are for 
instance the DEC Alpha chip (dissipating 30 Watts at 3.3V, 200 MHz) and the SUN 
Viking (dissipating 8 Watts at 5V, 50 MHz). 

Currently many circuits are designed by describing them in a behavioral description 
language like VHDL or Verilog. By using a synthesizer, this description is synthesized 
into a gate level netlist. This way of designing saves a lot of design time compared to 
traditional design methodologies like schematics entry. Most synthesizers are current- 
ly targeted towards fully synchronous designs, suitable for scan chain insertion, to be 
able to test the circuit by scan testing. 

One of the main contributors to power dissipation is the clock tree. The clock net is 
one of the nets with the highest switching density. The clock net is also a net with a 
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large fanout (all flip-flops are connected to the clock net) , resulting in high power dis- 
sipation. This clocking produces power dissipation on two points: 

• Dissipation in the clock drivers and the clock lines. 

• Dissipation in the flip-flops (most flip-flops contain an inverter connected to 
the clock signal.) 

In micro processor like designs there is a large number of registers that are there to hold 
their data most of the clock cycles. Analysis of the circuits generated by logic synthe- 
sizers dissolves that this functionality is implemented by providing a conditional loop 
back from the output of the flip-flop to its input. If such a loop back is active, the flip- 
flop needs not to be clocked, because the value of the flip-flop will not change; the 
flip-flop is in the so called ’’hold mode”, however due to the implementation with a 
loop back unnecessary power is consumed. A promising technique to reduce the pow- 
er dissipation of the clock net is selectively stopping the clock in parts of the circuit, 
called ’’clock gating”. This technique is not new at all and already applied in a number 
of ways. In (Schutz 1994) and (Suessmith et al. 1994) this technique is applied during 
the design of microprocessors, however the places where the gated-clocks are inserted 
are determined by the designer. In (Benini et al. 1995) an automatic method to insert 
gated-clocks in finite state machines is presented. Although every sequential circuit 
can be modeled as a finite state machine, this technique only works if the symbolic 
transition table of the implemented flnite state machine is known. For large circuits 
this is an impractical approach. In (Benini et al. 1997) the problem of generation the 
state transition table is circumvented, but still the idea of modelling the circuit as a 
single finite state machine is used. So only clock gating can be applied if the whole 
FSM (or design) performs a so called ’’self loop”. In practice however only some parts 
of a design can be switch off by clock gating. The tool presented in (Benini et al. 1997) 
will not be able to find these situations. In (Papachristou et al. 1995) a power saving 
technique is shown that during architectural synthesis determines which flip-flop can 
be switched off during the operation of the circuit. In this paper we present a method 
to generate gated-clock circuits starting from a netlist resulting from for instance a 
logic synthesizer. The idea of the developed method is to identify the flip-flops in the 
design that keep their data for a large portion of the clock cycles. For these flip-flops 
the condition will be determined for which they keep their data and the circuit will be 
transformed in such a way that the clock signal will be switched off if the condition 
is satisfied. In section 2 definitions will be presented. Section 3 will describe how the 
transformation is determined. Section 4 will give some implementation details. Re- 
sults and conclusions are presented in section 5 and 6. 



2 DEFINITIONS 

A mapped network N of logic gates will be represented by a logic network graph 

G„ = (V,£). Vrepresents the primary input and output terminals and the local func- 
tions (i.e. the gates). The set of directed edges E represents the decomposition of the 
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multi-terminal nets n E N into two terminal nets, directed from the output pin of a 
gate or a primary input to an input pin of a gate or a primary output. 

We consider the behavior of a gate (so also the behavior of the corresponding vertex 
V E V) as a completely specified function /v : B" /O, ly. Where B" is the set of all 
primary inputs and all the flip-flop outputs. The behavior /;, of a net n is defined as 
the behavior of the source vertex s of the net. The co-factor of a function 
with respect to a variable jc, is = /(x, X 2 ,..., 1 x„). The co- 

factor with respect to x] is /» = /(jciJC 2,...,0, ....jc„). The consensus of 
/(jci,jc2,..mJC/,....jc„) with respect to a variable JC/ is CON(f(x),Xi) = (De Micheli 
1 994). The consensus of a function with respect to a variable represents the component 
that is independent of that variable. The consensus operator can be extended to sets of 
variables as an iterative application of the consensus operator on the variables of the 
set. The equivalence on two function /and g is 

^ NOT(f^ g), (1) 

3 THE CIRCUIT TRANSFORMATION 

As described in the introduction the basic idea of the transformation is to switch off 
the clock of flip-flops that take their own data. This is only possible if there exists a 
path in the network graph from the output of a flip-flop to its own input. If the flip-flop 
has to keep its value, the output value is fed back to the input of the flip-flop. In the 
transformed circuit the clock of the flip-flop that has to keep its data will be switched 
off. This will result in the circuit transformation shown in Figure 1 . 




Figure 1. The circuit transformation. 

Two new signals have to be generated: 

1 The HoldExpression signal. This signal determines when the system 
clock will be fed to the flip-flop. 

2 The NonHoldExpression signal, being the new value of the flip-flop 
if the flip-flop is not in hold mode. 
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To circumvent glitches on the gated-clock signal caused by possible glitches occur- 
ring on the signal HoldExpression, a latch that is transparent if the signal elk is low, 
has to be introduced. 

3 .1 Synthesis of the hold expression 

Assume there exists a flip-flop ffE V in the logic network graph G„ = (V,£).The 
net connected to the data input of the flip-flip will be denoted E E. The data output 

of the flip-flop will be denoted E E. The expression describing the condition for 
which = qff will be called the hold expression = HoldExpression. To 

determine the expression fdJxi.Xj, ...,x„) where B" is the set of all primary inputs and 

all the flip-flop outputs, a traversal of the transitive input cone of node d^in a topologi- 
cal order has to be performed.(Cormen et al. 1989). The behavior of the output signal 
of a gate is determined from the behavior of its inputs and the function of the gate. After 
the behavior of the input of a flip-flop has been computed the hold expression can be 
determined by: 

( 2 ) 

In most cases the hold expression computed by (2) appears to be a rather complex ex- 
pression. This is due to the fact that expression (2) not only expresses the situation 
when the feedback loop around the flip-flop is active but also when the input of the 
flip-flop is accidentally equal to the new data that will be clocked into the flip-flop. 
This part of the hold expression is called the ’’data dependant part” of the hold expres- 
sion. Experiments have shown that the data dependant part of the hold expressions has 
two effects: 

1 The hold expression including the data dependent part is much more 
complex then the hold expression without the data dependant part. 

2 The hold expressions for the different bits of a register in an arithme- 
tic unit (adder, counter, subtracter etc.) are unequal to each other 
because of the data dependance. This will complicate the comparison 
of the hold expressions while composing a hold domain. 

It is undesirable that the hold expression is a rather complex expression because the 
hold expression has to be implemented in hardware, resulting in extra silicon area and 
also extra power dissipation. 

The control signals are in general the signals that determine when registers are in hold 
mode. The control signals are determined in the controlling finite state machine of the 
circuit or are primary input signals of the circuit. It are these signals that will enable 
and disable the feedback path around a flip-flop. If from the hold expression only the 
part that is described by the control signals is implemented it is likely that the imple- 
mented hold expressions remain simple. If the hold expression is described by only 
control signals the data dependant hold expressions will not be covered. As a first 
approximation signals that are member of a bus will be seen as data signals. All single 
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signals will be viewed as control signals. The control signals are defined as C C £ 
being a subset of all the signals of the circuit. The data signals are defined by 
D = £ - C. The hold expression described by only control signals can be computed 
by: 

hCff= CON(hff^D) (3) 

3 .2 Determination of the hold domains 

Once the hold expressions for all the individual flip-flops are determined, flip-flops 
are to be grouped into so called hold domains. A hold domain is a group of flip-flops 
whose members are connected to the same gated-clock signal. The condition for 
which all the flip-flops of a hold domain D are in hold mode is described by the hold 
domain expression Hp. Hp can be computed by 

Ho=Y\hCff ( 4 ) 

jfED 

The construction of the hold domains has to be such that the power reduction is as large 
as possible. The power reduction depends on: 

1 The number of flip-flops in the hold domain defined as || D || 

2 The relative number of clock cycles that the hold domain is in hold 
mode defined as \Hp\ 

Determination of \Hp\ is rather complex. In the general case \Hp\ depends on the prob- 
ability of all the signals in the support set of Hpand the correlation between those sig- 
nals. As an approximation the signals in the support of Hp will be assumed uncorre- 
lated. The probability of these signals being ”1” will be assumed to be 0.5. With these 
assumptions \Hp\ equals the part of the boolean space £ where //^= ” 1”. This 

can be determined easily (Janssen). 

To obtain a power reduction as large as possible IIDII * \Hp\ will be maximized. This 
will be done by the algorithm shown below. 

V: = {hc^hc2....,hcj 
While(V ^ 0 { 

D := {hcj A VjcE V:UI < \hc^\ 

H : = hc,.\ V : = V\D; changed : = true 
while (changed) { 

Test : = hCm A hc^ EVA 

Vjc E V : be A //I < \hc^ A H\; 
iffllDII X li/I < fllD UfTestJ || X 

\Test A H\) { 

D : = D[){ Test} 

V:= V\(Test} 

H A Test 
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} 

else changed : = false 

) 

iff II Dll > Threshold) 
implement hold domain D 
else 

V:= VUD\{hcJ 

) 

4 IMPLEMENTATION OF THE HOLD DOMAIN EXPRESSIONS 

The most simple way to implement the hold domain expressions is to use a logic syn- 
thesizer and technology mapper. If they are expressed in terms of primary inputs and 
flip-flop outputs the expressions can be rather complex. So direct implementation will 
result in extra area and dissipation. 

4 .1 Optimization of the hold domain expression 

In the existing circuit there are a lot of intermediate signals that can be used to optimize 
the hold domain expressions. In many circuits the hold domain expression is already 
implemented and used to control a multiplexer that constitutes the temporal feed back 
loop as shown in Figure 1 . To optimize the hold domain expression all the signals in 
the input cone of the flip-flop data inputs in the hold domain are tried to simplify the 
hold domain expression. The algorithm checks which local net simplifies as much 
as possible. This net is used to simplify Hj and a new net is searched for. 

Simplification is based on strong division. The result of a strong division of a boolean 



function /by a boolean function g is: 

f=g.a + r (5) 

The quotient a and the remainder r can be calculated as: 

a = /.g + X.g' = ite(g,f,X) (6) 

r = f.g' ^ Xf.g = ite(g,Xf,f) (7) 

with ^representing the don’t care constant. ite(a, b, c) is the ”if-then-else” operator 
defined as (Janssen) 

ite(a,b,c) = a.b + a'.c (8) 



Not in all cases the above described algorithm results in satisfactory results, sometimes 
the hold domain expression is still too large. In these cases the circuit shown in Figure 
2. is generated. This circuit determines when the input signals of the flip-flops equal 
the outputs of the flip-flops. So in that case the clock can be switched off. 

4 .2 The NonHoldExpression 

If a flip-flop is not in hold mode adequate data should be provided to the input of the 
flip-flop. Two approaches can be followed here: 
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Figure 2. Direct implementation of the HoIdDomainExpression 

1 The input signal D^of the flip-flop ffc&n be re-synthesized. As the 
don’t care set for this optimization the hold domain expression of 
the hold domain d to which the flip-flop belongs can be used. Also 
local nets can be used to simplify the resulting expression. 

2 The data input of the flip-flop can be kept as it was in the original 
circuit (including the feedback path from the output of the flip-flop 
to its input). 

In practice a mixture of these to approaches can be used. If method 1 yields a large ex- 
pression method 2 can be used. 

4 .3 Testability 

Currently most of the sequential designs are tested using the Scan Test method. This 
method assumes fully synchronous circuits. The introduction of gated-clocks violates 
this assumption. Also current test-vector generators assume fully synchronous cir- 
cuits. As shown in (Favalli et al. 1996) by application of a network transformation 
(Figure 3.) and by addition of some extra test control signals in the clock generation 
circuitry as shown in Figure 4. it is possible to generate test vectors and to test the 
gated-clock circuit. 

5 RESULTS 

The developed tool is tested on two designs. The first being an 8 bit micro controller 
called CON, the second a 16 bit general purpose signal processor called DSP. The de- 
signs the .tool is applied to are produced by a VHDL synthesizer. To keep the computa- 
tion time in the order of seconds, the described algorithm is applied to the hierarchical 
netlist. In this way the ROBDD’s do not explode. By applying the tool to the non flat- 



396 



Part Nine CAD Techniques for Low-power Design 




Figure 3. Re-modelling of the gated clock circuit for test generation 



not(HoldDomainExpression) 



TC 






D- 



NewClock 



elk — ‘ ' 

Figure 4. Addition of test control signals 

tened netlist, hold domains that exist over hierarchy boundaries are not detected. In 
practice this appears not to influence the results. The results are shown in Table 1. 



Table 1. Results of clock gating 





CON 


DSP 


#D-ff 


732 


2183 


#D -ff in loop 


528 


2015 


#D-ff in hold domain 


413 


1548 


#hold domains 


55 


89 


#local net impl 


41 


88 


#xor network impl 


14 


1 


av. # d-ff/hold domain 


7.5 


17.3 


size org circuit 
(in equiv gates) 


10812.6 


56781.0 


size new circuit 


11650.2 


57214.3 


(in equiv gates) 


(+7.7%) 


(+0.7%) 


epu time (in sec) 


189 


406 



Power estimation of the circuits is done with an accurate gate level power estimator. 
This estimator works together with a logic simulator and counts the number of signal 
transitions during simulation. Each transition of a net results in a contribution to the 
power dissipation of the circuit. The tool takes net specific loading, slopes of the sig- 
nals and type of the driving cell into account. Table 2 shows the dissipation results for 
design CON. 
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Table 2 Power reduction for design CON 


design CON 


test# 


original 


gated-clock 


gated-clock with 
clcok buffer 
reduction 


useful gated-clock 
with clock buffer 
reduction 




mW 


mW rel 


mW rel 


mW rel 


1 


14.5 


11.7 0.81 


11.4 0.78 


11.1 0.76 



As can be seen a reduction of 19% can be obtained by application of the clock gating 
tool. Because of the fact that the loading on the primary clock is reduced considerably 
(from 732 to 2 * 55 + 732 - 413 = 429) the clock buffer can be reduced. This yields 
in an extra reduction of 3%. The power analysis tool gives also information about the 
number of clock cycles a hold domain is in hold mode. If the number of transitions of 
a gated-clock signal is not significant tower than the number of transitions of the origi- 
nal clock, the clock domain will not contribute to power reduction, so this domain is 
canceled. This again leads to a reduction of 2%. 

Table 3 gives the power dissipation data for the design DSP. As can be seen, the power 
dissipation and reduction depends on the input data for simulation. An average power 
reduction of 27% can be obtained. (The clock buffer of this circuit was not included 
in the design). 

Table 3 Power reduction for design DSP 



design DSP 


test# 


original 


gated-clock 


usefull gated-clock 




mW 


mW rel 


mW 


rel 


1 


177 


145 0.82 


134 


0.76 


2 


163 


130 0.80 


119 


0.73 


3 


152 


119 0.78 


108 


0.71 


4 


176 


146 0.83 


134 


0.76 



6 DISCUSSION AND FUTURE WORK 

As has been shown it is possible to obtain a power reduction of up to 29% by applying 
clock gating techniques on micro controller designs at only moderate area penalty (< 
8%). The size of the circuits that can be handled is much larger than has been reported 
so far. Testability of the transformed circuit is maintained during the transformation 
and the circuit transformations are kept as small as possible. 

6 .1 Timing issues 

By the application of the described tool, on two places timing problems can occur: 
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1 The gated clock signal NewClock (Figure 4.) is delayed one gate 
delay (of the AND gate). This gives rise to clock skew in the result- 
ing circuit. 

2 The signal HoldDomainExpression (Figure 4.) can possibly violate 
the timing constraints. 

ad 1 ) The added clock skew is the same for all the gated clock signals. By clock tree 
generation extra buffers can be added to the non gated clock signal to compen- 
sate for this skew. In the layout phase the flip-flops belonging to one hold do- 
main should be kept close to each other to reduce skew inside a clock domain 
because of parasitic capacitances due to wiring. 

ad 2) In practice, the new circuitry added for the new HoldDomainExpression is very 
small. The circuits designed until now did not show problems on this point, 
however there is a potential problem. 

6 .2 VHDL transformations 

As discussed, in the current tool, timing problems can occur. To circumvent these 
problems, it seems a good idea to perform clock gating transformations before synthe- 
sis, i.e. in the VHDL description. In this way timing constraints, given to the synthesiz- 
er, will be applied to the gated clock circuit. Initial tests have shown that this approach 
is feasible. 
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Abstract 

This paper presents a timing-driven floorplanning algorithm for building block 
layout. As the interconnection delay model, the proposed algorithm adopts 
the Elmore delay model. The algorithm consists of two phases. In phase 1, a 
timing-driven topological arrangement of blocks is generated with the resolu- 
tion of overlap among blocks using nonlinear programming under the timing 
constraints. In phase 2, the algorithm performs floor plan sizing which deter- 
mines the sizes and the shapes of blocks based on the topological arrangement 
obtained in phase 1 so as to minimize the chip area, and obtains a legal floor- 
plan. This phase is based on the topological constraint manipulation. Through 
the experimental results, the proposed algorithm can produce results without 
any timing violations within a practical computation time. 



Keywords 

Floorplanning, Elmore delay, nonlinear programming, topological constraint 



1 INTRODUCTION 

In building block layout, a VLSI circuit is partitioned into a set of components, 
referred to as blocks, and there are two types of blocks, that is, hard blocks 
and soft blocks. For the former, they are pre-designed and the shapes and the 
pin positions of the blocks are pre-deter mined. For the latter, since they are 
not pre-designed, we need to specify the exact shapes and the pin positions 
in layout design. The goal of building block layout design is to determine 
the shapes of blocks, the positions of blocks, and the interconnections among 
blocks so that the chip area is utilized in the best possible way. 

In the building block layout design, floorplanning is the first and an essen- 
tial design step to determine several important factors of the layout such as 
the overall required area of a chip, the sizes and shapes of blocks, pin and 
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pad locations, etc. The goal of floorplanning is to realize a placement plan 
that will decide topological proximity as well as appropriate shapes of blocks. 
Floorplanning is usually divided into two steps (Lengauer 1990), which are 
generally called global placement and detailed placement. In the global place- 
ment step, we determine the relative positions of blocks so as to minimize the 
total wire length. Since blocks are generally treated as points, which are the 
center coordinates of the blocks, and are placed in the chip, the blocks may 
be overlapped each other because of their variety of the shapes and the areas. 
On the other hand, the detailed placement essentially converts a global place- 
ment floorplan into a legal optimal area floorplan* which respects a placement 
produced in the global placement step. 

Due to the progress of semi-conductor process technologies, interconnection 
delay cannot be ignored as well as switching delay of gates in the physical 
design(Bakoglu 1990). Thus we need to explicitly consider the interconnection 
delay during layout design. Similar to the placement step, the floorplanning 
step has a large effect on the performance of a circuit. Therefore several timing- 
driven algorithms have been proposed (Prasitjutrakul & Kubitz 1989, Ogawa 
et al. 1990, Gao et al. 1992, Tia & Liu 1993, Sait et al. 1994, Youssef et al. 
1995). 

In this paper, we propose a timing-driven floorplanning algorithm for the 
building block layout. The proposed algorithm adopts the Elmore delay model 
as the interconnection delay model. The proposed floorplanning algorithm 
consists of two phases: (1) construction of a timing-driven topological ar- 
rangement with nonlinear programming, (2) conversion of the topological ar- 
rangement into a legal floorplan. In the first phase, we determine a topological 
arrangement of blocks using nonlinear programming considering the timing 
constraints and the resolution of overlap among blocks. Next, we perform 
floorplan sizing which determines the sizes and the shapes of blocks based 
on the topological arrangement obtained in the first phase so as to minimize 
the chip area and obtain a legal floorplan. The second phase is based on the 
topological constraint manipulation. Prom the experimental results compared 
with a non-timing-driven floorplanning algorithm, the proposed floorplanning 
algorithm can produce floorplans without any timing violations in a practical 
computation time. 

The remainder of this paper is organized as follows. In Section 2, we de- 
fine the interconnection delay model and the timing constraint, and formulate 
the floorplanning problem. Section 3 presents a timing-driven floorplanning 
algorithm with nonlinear programming and topological constraint manipula- 
tion. Section 4 shows experimental results and the evaluation of the proposed 
algorithm. Finally, in Section 5, we conclude this paper and describe future 
work. 



*We define a floorplan to be legal if there are no overlaps among the blocks and the shapes 
and placement of the blocks satisfy the aspect ratio bounds and pre-placement conditions. 




A timing-driven floorplanning algorithm 



405 



2 PROBLEM FORMULATION 

A chip is a rectangular region which is surrounded by I/O pins on its boundary 
and contains blocks of different sizes within this boundary. Let C = {M^Af) 
be a logic circuit, where A4 is a set of blocks and A/" is a set of nets. A set of 
blocks M = {mi, 7712, • • • , ttim} consists of a set of hard blocks Mh and a set 
of soft blocks Ms. Let w{rrij) and h{rrij) be the width and the height of block 
rrij, respectively, and let a{rrij) and r{rrij) be the area and the aspect ratio of 
block rrij, respectively. For hard block mhi € Mh, its width and height are 
fixed and the pin positions are specified in advance. For soft block msi e Ms, 
its width and height are temporarily given but not fixed, and its area and 
lower and upper bounds of its aspect ratio, {ri{msi), ru{msi)), are specified. 
A set of nets J\f = { 771 , 712 , • * • ,77iv} describes the connectivity information. 
For simplicity in the presentation, it is assumed that all the pins of a block 
are located at the center of the block. We assume that the interconnections 
are realized by using two layers, the first metal layer (Ml) and the second 
metal layer (M2). The Ml layer is mainly used for horizontal wiring and the 
M2 layer is mainly for vertical wiring. 

An equivalent circuit of an interconnection is originally modeled as a dis- 
tributed RC circuit, and the Elmore’s delay equation{Elmoie 1948) is often 
used to represent the interconnection delay (Figure 1). When a multi-terminal 
net rii is implemented by a Steiner tree, Kuh et al. give an upper bound of the 
Elmore delay from the source pin to a load pin j of net rii (Kuh & Shih 1992). 

We employ the equation given by Kuh et al. as the interconnection delay 
model. However, it is not practical to construct Steiner trees during floorplan- 
ning from the point of computation time. Therefore the wire length of net ni 
is estimated with half the perimeter length of a bounding box of enclosing the 
pins of net Ui, the path length from the source to a load j of net rii with half 
the perimeter length of a bounding box enclosing the source pin and load pin 
j. Figure 1 shows an example of the estimation of the wire length. Then the 
delay from the source pin to load pin j of net rii is defined as, 

delij{Wi,hi,l\ij ,l^if) — (^C\Wi C2hi -h ^ ^ ^/fc) (RjO "b (1) 

where ^ 



Wi 




Figure 1 Interconnection delay model. 
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Wi, hi : the width and height of the bounding box of net n^, respectively. 
llij, I2ij : the width and height of the bounding box enclosing the source 
pin and load pin j of net rii^ respectively. 

Cl ,C 2 : the capacitances of Ml and M2 per unit length, respectively. 

ri,r 2 : the resistances of Ml and M2 per unit length, respectively. 

Rio : the equivalent output resistance of the source. 

Cik : the sum of the load capacitances. 

In the following, we abbreviate delij{wi,hiJlijJ2ij) to delij for simplicity. 

In general, a timing constraint can be specified between a primary input or 
an output of flip-fiops(FFs) and a primary output or inputs of FFs. However, 
soft blocks are not pre-designed in the hierarchical design of the building block 
layout and hence the timing constraints are separately treated as the timing 
constraints of the inside and outside of soft blocks. Therefore, in this paper, 
we assume that a timing constraint is given between the source pin and a load 
pin of net n^. Thus we specify a timing constraint of the source pin and a load 
pin j of net rii as the maximum allowable propagation delay Delaiiown from 
the source to the load. Consequently, if delij < Delaiiowni then the timing 
constraint is satisfied. 

Now, we formulate the timing-driven fioorplanning problem (TDFP) as 
follows. Given (1) a set of blocks M = Mh U Ms, (2) a set of nets J\f, (3) 
timing constraints T, and (4) physical parameters as input, the problem is 
to determine a legal floorplan of M which minimizes the chip area under the 
conditions; (1) no two blocks overlap, (2) Vmsi € Ms,ri{msi) < r{msi) < 
ru{msi), (3) Vmsi G Ms,a{msi) = w{msi) x h{msi), and (4) satisfy the 
timing constraints T. 

3 A TIMING-DRIVEN FLOORPLANNING ALGORITHM 

The proposed algorithm consists of two phases: (1) construction of a timing- 
driven topological arrangement with nonlinear programming, (2) conversion 
of the topological arrangement into a legal floorplan. In the following, we 
present the details of each phase. 

3.1 Timing-Driven Topological Arrangement with 
Nonlinear Programming (Phase 1) 

The objective of phase 1 is to obtain a topological arrangement of blocks 
so as to minimize the total wire length under the given timing constraints. 
To achieve the objectives, we transform the topological arrangement problem 
to a mathematical programming problem. But mathematical programming 
tends to require much computation time and memory space. Therefore we 
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apply mathematical programming to subcircuits, for which the formulated 
problem can be solved in a practical computation time and with practical size 
of memory space. 

In order to obtain an initial topological arrangement, we apply the timing- 
driven placement algorithm for standard cell layout we have presented in 
(Koide et al. 1995). However, the width and height of blocks are much larger 
than those of standard cells and some blocks may be overlapped each other. 
To minimize overlaps between blocks, we select a target subcircuit and add 
a set of constraints that specify a minimum separation of each pair of blocks. 
Then we formulate the problem into a nonlinear programming problem with 
the timing and minimum separation constraints, and obtain a new topological 
arrangement by solving it. Those operations are repeated until the topological 
arrangement satisfies the timing constraints and the ratio of overlaps among 
blocks becomes less than a pre-determined value. In the following, first, we 
explain selection of a target subcircuit and then formulate the topological 
arrangement problem as a nonlinear programming problem. 

(a) Selection of a Target Subcircuit 

Now, we define a target subcircuit as Cmov = {Mmov^J^mov)^ where Mmov 
is a set of blocks, called movable blocks^ of the subcircuit, and Afmov is a set 
of nets, called movable nets, each connecting to at least one movable blocks. 
Cells other than movable blocks in the whole circuit are called fixed blocks and 
their set is represented by Mfix{= M - Mmov)- Nets other than movable 
nets in the whole circuit are called fixed nets and their set is represented by 
Mfix{— ~ Mmov)- 

First, we find one of nets with a large violation ratio. The violation ratio 
is the value of actual delay time of a net divided by the allowable delay time 
of it, i.e., delij/Delaiiowij- The candidate nets are selected from the largest 
10~20 percent in all the nets in point of the violation ratio. But, to improve 
the topological arrangement in a small number of iteration, we don’t select 
any nets which have been selected in the last fc(> 0) iterations. Next, let a 
set of blocks which are connected to the selected net be the initial subcircuit. 
Then the subcircuit is expanded by adding blocks one by one. The added 
block should have large connectivity, which is the number of connections to 
the present subcircuit. To avoid repeatedly selecting the same block to be 
added to the target subcircuit at each iteration, we introduce a randomness 
in the selection step and determine whether the block is included. If included, 
all the nets connecting to it are redefined as a set of movable nets. 

To solve the problem in a practical computation time, we must limit the 
number of variables of the mathematical programming. Hence, the growing 
process of the subcircuit continues until the number of variables of the math- 
ematical programming problem reaches to a given constant. 

(b) NLP Formulation of the Problem 

Now, we show how to formulate the placement problem into a nonlinear pro- 
gramming(NLP) problem. We define variables to represent the wire lengths of 
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movable nets as well as the positions of movable blocks. Let xj and yj be the 
center X and Y coordinates of movable block mj G Mmov, respectively. De- 
note by Wi and hi the width and height of the bounding box of net rii G Afmov , 
respectively. Let Mj be a set of blocks connecting to net rij. Then, we can 
represent the bounding box of movable net rii by the following inequalities: 



CNl : Xj - Xk <Wi yrrij ^ ruk e MiC\ Mmov, 
Vj ~ Uk hi j ^Tli G Mmov 



(2) 



These inequalities mean that for any pair of movable blocks connecting to n^, 
they are completely included in the bounding box of n^. When rii Is connected 
to fixed blocks, the following inequalities should be satisfied: 



CN2 : 

Xj Xmini ^ > Vj Ymirii 

Xmaxi Xj ^ 'UJifYmaxi Vj 

Xmaxi Xmirii ^ 'Wi^Ymaxi Ymini 

where Xmim and Xmaxi (resp. Ymim and Ymaxi ) are the minimum and max- 
imum X (resp. Y) coordinates in fixed blocks of Mi. If a net has some fixed 
blocks, the bounding box of the fixed blocks can be constructed, and for any 
pair of a movable block and the bounding box, their bounding box is com- 
pletely included in the bounding box of n^. 

To calculate delay delij from the source to load j of net rii, we introduce vari- 
ables llij, I2ij to represent the horizontal and vertical lengths of the bounding 
box of the source and load j. 



< hi 

< hi 

< hi 



Vt71j[ G .Adi Gl Ximov ^ 
VtT-i G X’mov 



(3) 



CS : Xk - xi < llij, Uk 

xi - Xk < llij, yi 



yi 

yk 



12. 



< 

< Liiij 



" } 






(4) 



where {xk^yk) and {xi,yi) denote the center coordinates of cells corresponding 
to the source and sinks of Ui. 

As mentioned in Section 2, the timing constraints are specified by pairs of 
the source pin and a load pin j of net rii ^ Then the constraints of net 
delay can be written as follows. 



CD . delij (Wi, hi, iXij , I2ij^ — ^'tli ^ <^mov (^) 



To control block overlaps during phase 1, we specify a minimum separation 
between two blocks. The minimum separation is expressed by the Manhattan 
distance between the centers of two blocks. The distance between blocks rrij 
and m/c, denoted distjk, is defined as, distjk = \/ {xj - Xk^ H- {yj - ykY> In 
phase 1 of the algorithm, we do not have to delete all the overlaps because 
phase 2 does. To avoid over-constraining the problem, we set the minimum 
separation between rrij and to 



sepjk = Vjk X max{ 



w{rrij) -f w{mk) h{rrij) + h{rrik) 






( 6 ) 



where w{rrij) and h{rrij) are the width and height of rrij, respectively, and r]jk 
denotes a parameter(0 < rjjk < 1). We initially set rjjk to 0.5 and dynami- 
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cally increase the value up to 1 during the iterative improvement step. If we 
consider the minimum separation constraints for any pair of blocks, we need 
M{M - 1) /2 constraints where M is the number of blocks. To solve the math- 
ematical programming problem in a practical computation time, we therefore 
select the target subcircuit Cmov = {MmoviJ^mov) and apply mathematical 
programming to Cmov' Then the constraints of minimum separation can be 
written as follows. 

CO : distjk < sepjk, "irrij.mk E Mmov^trij ^ nuk 

distjj^ ^ Sepj/j, ^Tflj G ^ (7) 

Now, we formulate the topological arrangement problems without /with 
the minimum separation constraints as the following nonlinear programming 
problems. 

[NLP without/ with the minimum separation constraints] 

minimize : ^ 

VtIi GA /” mov 

subject to : CNl U CN2 UCSUCD (8) 

subject to : CNl U CN2 UCSUCDUCO (9) 

where Oi denotes a constant considering the criticality of net Ui. 

(c) The Algorithm of Phase 1 

The proposed algorithm iteratively improves a topological arrangement so 
as to minimize the wire length and to satisfy the timing constrains and the 
minimum separation constraints. First, the algorithm selects all the blocks 
as the target subcircuit and formulates the NLP without the minimum sep- 
aration constraints (8) for the target subcircuit. Then an initial topological 
arrangement is generated by solving the NLP problem. At each iteration, the 
algorithm performs timing verification and calculates timing violation ratios 
for all the nets. Then, it constructs a target subcircuit, solves the formu- 
lated nonlinear programming (NLP) problem with the separation constraints 
(9), and places blocks in the target subcircuit based on the result of NLP 
problem. These processes are repeated until the maximum violation ratio and 
the overlapping ratio become less than pre-determined values or the itera- 
tion count reaches a preset value. In the current implementation, we use the 
MINOS 5.4 (Murtagh & Saunders 1995) package to solve the NLP problem. 



3.2 Floorplanning with Topological Constraint 
Manipulation ( Phase 2 ) 

In phase 2, we determine the block attributes (e.g., absolute locations, dimen- 
sions of soft blocks) so as to minimize the chip area. The output from phase 
1 is a topological arrangement optimized for timing and connectivity. In or- 
der to get a final legal floorplan, we must remove all overlap among blocks. 
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(a) phase 1 



% 



Gy 

(b) topological constraints 



4 


1 


2 


3 



(c) phase 2 



Figure 2 A topological arrangement of 4 blocks and the corresponding topo- 
logical constraint set. 



This phase is performed without undoing any of the decisions of the first 
phase, i.e., topological proximity between blocks in the solution produced by 
the first phase is maintained. We adopt a constraint-based approach we have 
proposed in (Koide et al. 1994) to convert the topological arrangement into a 
legal floorplan. 

The topological arrangement is interpreted as a set of topological con- 
straints. A topological constraint set of blocks is given by two directed acyclic 
graphs Gy) : Gh is a horizontal constraint graph and Gy is a vertical 
constraint ^rap/i (Vijayan &: Tsay 1991). The node set of both Gh and Gy 
is exactly the set of blocks A4. If (m^, m^), mi, mj 6 A4, is an edge in G//, 
then rui is to be placed to the left of ruj. If (mi, mj), mi, mj G A4, is an 
edge in Gv, then mi is to be placed below mj. Figure 2 shows an example 
of the topological arrangement of blocks and the corresponding topological 
constraint set. 

In phase 2, we consider the following floorplanning problem FTC (Floor- 
planning with Topological Constraint). Given a set of blocks, a strongly com- 
plete * constraint set {Gh, Gy), and bounds of aspect ratio for each soft block, 
as input, the problem is to determine a legal floorplan, that is, the absolute 
locations of the blocks and dimensions of the soft blocks, so as to minimize 
the chip area under the constraints such that the soft blocks must satisfy the 
given bounds on aspect ratios and a floorplan must strongly respect the input 
topological constraint set. 

The proposed algorithm in phase 2 is an extension of the algorithm pre- 
sented in (Vijayan & Tsay 1990, Vijayan & Tsay 1991), which we refer to as 
the conventional algorithm, or the algorithm VT, in the following. The ap- 
proach of the algorithm VT is to first derive a complete constraint set from 
the input relative placement and then to delete those redundant constraints 
on longest paths on Gy and Gh until no redundant edges in Gy or Gh exist. 
Next, soft blocks on longest paths on Gy and Gh ^re also reshaped for some 
constant times to further reduce the chip area. The algorithm repeats both 
the steps until no improvement of the solution is achieved. 



*A constraint set {Gh, Gy) is said to be strongly complete if for any pair of blocks mi and 
mj, there is an edge connecting them in either Gh or Gy or both. 
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First, we introduce some modification of the algorithm VT, which is used as 
a procedure in the proposed algorithm. Unlike the algorithm VT, we reshape 
all soft blocks on the longest path in both horizontal and vertical constraint 
graphs to reduce the chip area for some constant times. For soft blocks which 
are not on the selected longest path, we also reshape the soft blocks to make 
their aspect ratio close to one. In the following, the algorithm VT with the 
above modification is called the Modified Vijayan and Tsay algorithm {the 
algorithm MVT). 

Next, we describe a tentative insertion{Koide et al. 1994) of a constraint. 
In the algorithm VT, the ordering of constraints to be deleted may influence 
the final results. To search in a wide range of solution space by changing the 
ordering of deletion of constraints, we may firstly modify the given constraint 
set by tentatively inserting a dummy constraint 5^, which is not contained in 
the original constraint set {Gh^ Gy). This constraint is called a soft constraint. 
On the other hand, a non-redundant constraint which is originally contained 
in the given constraint set (G//? Gy) is said to be a hard constraint. This 
tentative insertion of a soft constraint Si is said to be tentative insertion. 
In the proposed algorithm, we tentatively insert a soft constraint, and check 
whether one of the longest paths on the constraint graphs Gh and Gy is 
changed. If it is, we actually insert the soft constraint, update the constraint 
set, and apply the algorithm MVT. To improve the obtained solution, we also 
reinsert not one deleted constraint but all deleted constraints (mj, mi) relative 
to each block mi e M in the previous step and apply MVT again to obtain 
a final result. 

During the algorithm execution, the algorithm keeps the tentatively best 
solution. When updating the tentatively best solution, the algorithm checks 
not only the chip area but also timing violation. Thus, the final solution of 
phase 2 is assured to satisfy the timing constraint. 



4 EXPERIMENTAL RESULTS 

The proposed algorithm has been implemented in C on a UltraSPARC work- 
station, and tested for MCNC benchmark data. In the experiments, we used 
“ami33” (123 nets, 33 blocks, 42 I/O pins, and 92 timing constraints) and gen- 
erated two data “ami33a” and “ami33b” by changing positions of I/O pads 
in “ami33” . In the experiments, all blocks are regarded as soft blocks. Since 
there is no timing information of the original circuits, we hence estimated the 
delay of each net based on a floorplanning result which was obtained by a 
non-timing-driven floor planning algorithm, and gave a timing constraint for 
each net multiplying the estimated delay by 0.9 ~ 1.2. 

To show the effectiveness of phase 1 of the proposed floorplanning algorithm, 
we compared the results of phase 1 of the proposed timing-driven algorithm 
with the proposed algorithm without considering the timing constraints. In 
the experiments, we set the permissible violation ratio, the overlapping ratio 
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Table 1 Experimental results of phase 1 of the floorplanning. 



data 


Algo 


Length 


#vio 


Max vio 


Overlap ratio 


Time(sec) 


ami33 


without 


73518 


8 


1.24 


0.19 


3145.5 


with 


73432 


0 


0.98 


0.17 


3136.3 


ami33a 


without 


95648 


9 


1.21 






with 






0.98 


0.12 




ami33b 


without 


94638 


8 


1.47 


0.15 


3715.3 


with 


96426 


0 


0.96 


0.19 


4769.0 



Table 2 Experimental results of phase 2 of the floorplanning algorithm. 



data 


Algo 


Area 


Length 


#vio 


Max vio 


Time(sec) 


ami33 


conv 


1507695 


72379 


2 


1.30 


5.0 


pro 


1562697 


78032 


0 


0.87 


114.8 


ami33a 


conv 


1873632 


100247 


7 


1.50 


3.4 


pro 


1706821 


94501 


0 


0.98 


156.2 


ami33b 


conv 


1526887 


96858 


7 


1.53 


3.6 


pro 


1479049 


99328 


0 


0.98 


108.0 



of blocks, and the maximum iteration count to 0.80, 0.2, and 300, respectively. 
Table 1 shows the results of phase 1. In the table, “without” and “with” mean 
the proposed algorithms without and with considering the timing constraints, 
respectively. “Length” is the estimated total wire length, that is, the sum of 
half the perimeter length of the bounding box for each net. The term “#vio” 
is the number of violated timing constraints and “Max vio” is the maximum 
violation ratio for the given timing constraints. “Overlap ratio” is the ratio 
of the sum of the overlapped area among all the blocks to the total area of 
blocks. 

From the results in Table 1, while the algorithm without considering the 
timing constraints violates some timing constraints, the proposed algorithm 
can produce the comparable results without any timing violations. As for the 
running time, since the proposed algorithm solves the NLP problem which 
takes the minimum separation constraints as well as the timing constraints 
into consideration, the running time of the algorithm takes about one hour. 
However, the number of blocks in the building block layout is generally small 
and therefore we think this running time is still considered practical. 

Next, in Table 2, we show the floorplanning results obtained by applying 
the detailed placement step (phase 2) to the results of phase 1 in Table 1. For 
the results of phase 1 of the proposed timing-driven algorithm, we performed 
phase 2 of the proposed algorithm presented in Section 3.2. On the other hand, 
for the results of phase 1 of the algorithm without timing constraints, we ap- 
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plied the algorithm VT(Vijayan k Tsay 1991). In Table 2, “pro” means the 
proposed floorplanning algorithm and “conv” denotes the conventional algo- 
rithm which does not consider the timing constraints in the global placement 
step and applies the algorithm VT in the detailed placement step. “Area” 
denotes the area of the minimum rectangle enclosing all the blocks. From 
the results, the proposed floorplanning algorithm can successfully produce 
results without any timing violations within a practical computation time. 
Since phase 1 of the proposed algorithm can produce the global placements 
that satisfy the given timing constrains and have few overlapped area, phase 
2 of the proposed algorithm can produce the floorplanning without timing vi- 
olations by maintaining topological proximity between blocks in the solution 
produced by phase 1. 



5 CONCLUSION 

In this paper, we proposed a timing-driven floorplanning algorithm based on 
nonlinear programming and topological constraint manipulation. The pro- 
posed algorithm consists of two phases. The topological arrangement in the 
first phase is iteratively improved by solving the nonlinear programming prob- 
lem for a target subcircuit. The second phase transforms this topological ar- 
rangement into a legal floorplan and determines the positions of blocks and 
shapes of soft blocks so as to minimize the chip area under the given timing 
constraints. Through the simulation experiments, the proposed algorithm can 
produce floorplans without any timing violations in a practical computation 
time. Future research includes a reduction of the computation time of the 
algorithm and an extension of the algorithm to optimize the dimensions of 
blocks in the first phase. 
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Abstract 

This paper presents a new layout style for macro-cells, using three-metal layers 
CMOS processes with stacked vias. The routing method between leaf cells reduces 
the track density up to 50% (compared to a double-layer router), allowing an 
important area reduction. The motivation to develop this work is the requirement 
in submicron technologies: smaller area, small delay and less power consumption. 
To attain these requirements, the main cost function in all algorithms (cell 
generation, placement, routing) is the parasitic capacitance reduction. Our 
contribution is the new layout style and the associated routing method. 

Key-Words 

Automatic Layout Synthesis, Layout Style, CAD, Routing. 



1. INTRODUCTION 

The standard-cell approach is currently considered by VLSI designers the best 
solution to synthesize large random logic blocks. Since the cells in a library are 
pre-characterized, it is easy for the designer to simulate large blocks without 
synthesizing the layout. This leads to accurate delay estimation of the final layout. 
Then, why using and developing automatic layout synthesis tools? The answer is 
performance and power reduction. The main advantage of automatic layout 
synthesis is the possibility to individually size each transistor in the circuit 
according to delay constraints. In this way, the designer chose transistor sizes 
according to the area-delay-power trade-off. 

VLSI: Integrated Systems on Silicon R. Reis & L. Claesen (Eds.) 

OIFIP 1997 Published by Chapman & Hall 




416 



Part Ten Physical Design Issues in Sub-micron Technologies 



The basic leaf cell layout style was defined by Uehara (1981), which we call 
from now on linear-matrix. The linear-matrix style was the basis for many other 
styles in the 80's, In this style, the main cost function is the diffusion gap 
minimization, aiming at area and side- wall capacitance reduction. A cell with no 
diffusion gaps has minimal area and minimal parasitic capacitance values. 
Another example of a popular layout style in the 80's is the gate-matrix (Lopez, 
1980). The gate-matrix style has a great number of diffusion gaps and 
consequently poor performance due to the associated parasitic capacitances. It is 
important to remark that the parasitic capacitance reduction was not so relevant in 
’’micron" technologies, where the cell delay is normally much bigger than the 
associated routing delay. However, in submicron technologies, cell and routing 
delay are comparable, justifying new cell synthesis and routing methods. Also, in 
submicron technologies there are several routing layers and stacked vias, which 
increases the need to change the traditional concepts of layout style and routing. 

THEDA system (Hwang, 1993) adopts a layout style with power/ground rails 
routed in the first metal layer between, instead of outside, the diffusion rows. The 
main advantages of this style are: (0 the possibility to create large transistors, with 
smaller impact in the silicon area (the transistor can be extended in the channel 
routing); {ii) the polysilicon length is reduced (as it has poor conductivity, it is 
important to avoid wires in this layer); {in) the routing is simplified. At present, 
some technologies are using silicides over polysilicon (polycides) and contacts, 
increasing its conductivity, allowing to create local wires with polysilicon. 

In Lin (1994 and 1996) is considered the transistor sizing problem, using the 
THEDA system (Hwang, 1993). This style was used due to the facility to create 
large transistors. Traditional layout styles require all cells in a row with the same 
height. This constraint lead to either wasted area in small cells or layout failure in 
large cells. Because the power/ground rails are interposed among transistors in the 
layout style of THEDA this constraint can be relaxed. This allows that transistors 
widths be arbitrarily increased as needed. 

In Kim (1994) is presented a cell model, also with power/ground rails among 
transistors, with input/output pins between power/ground rails, in the middle of 
the cell. The first metal layer is used at cell level to create internal connections 
and metal2/metal3 for connection between cells. The routing is performed over 
transistors, resulting in a channelless approach. 

In Fukui (1995) is presented a two-dimensional transistor arrangement, instead 
of a linear arrangement. In this style, the main cost function is the optimal 
transistor placement (horizontally and vertically), instead of the number of 
diffusion gaps. Cell area is comparable to hand crafted design, but the number of 
diffusion gaps is high. The authors take into account only silicon area, and there 
is no electrical simulation data in the paper. In Kim (1992) another layout style is 
presented, with transistor placement perpendicular to power rails. This style 
allows to individually size each transistor, keeping the layout area as small as 
possible. The handicap of it is the routing area: there are vertical wires to connect 
cell gates and horizontal wires to connect different cells, resulting in small 
transistor densities. 
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Routing with three metal layers has been used mainly within the over-the-cell 
(OTC) approach (Terai, 1994; Kim, 1996). The goal of the OTC is to reduce track 
density, employing one metal layer over the cells. In OTC routing, the cell layout 
area as well as the channel area between two cell rows are used as routing 
resources. Leaf cells must have "transparency" to one metal layer (for example, 
metal3), usually with power/ground rails into the middle of the cell in order to 
reduce obstacles. In Kim (1996) is showed that an OTC router achieves area 
savings from 21 to 50% in benchmark circuits. 

Our objective is to present a new layout style with the following features: (/) 
parasitic capacitance reduction; (ii) small silicon area; (Hi) routing with 3 metal 
layers; (iv) minimum use of polysilicon layer; (v) transistors with arbitrary widths. 

The rest of this paper is organized as follows: section 2 analyzes the layout style 
at cell level, section 3 presents the main algorithms to synthesize macro-cells 
(partition and routing), section 4 discusses some preliminary results as well as our 
conclusions and directions for future work. 



2. THE NEW LAYOUT STYLE 

In our approach, at the cell level, transistors are still placed using the linear- 
matrix style, i.e., two horizontal diffusion lines parallel to power/ground rails. In 
this style transistors are connected, as much as possible, by abutment, aiming to 
reduce diffusion capacitances. Figure 1 shows a serial connection between two 
transistors and the associated parasitic capacitance (1.0 jim technology, 
capacitances for the diffusion layer: Carea = 0.31 fF/|im^ and Cside-waii = 0.45 
fF/|xm). 



connection by 
abutment: 



connection by 
metai: 

(diffusion gap) 




penmetar = 275+275+1,5 



Opar “ ^area 2.25pm + Csi(je-waii 3pm 
Cpar = 2.05fF 



Cpar = 2*( *4.12Mtn + 7pm) 

Cpar = 8.86 fF ( + Ci^etal + ^contact ) 



Figure 1 Parasitic capacitance of serial connected transistors 
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For this technology, the input capacitance of a minimum inverter is 4.31 fF, 
smaller than the capacitance generated by a diffusion gap. This example shows 
why transistors must be connected by abutment and how important it is to reduce 
parasitic capacitances in order to improve performance. 

The power and ground rails are placed over the transistors, using the second 
metal layer {metal2). The connection between drains/sources to the power/ground 
rails is done with stacked vias (vial over contact). The contacts to the substrate 
(body-ties) are placed in the routing region, above the PMOS diffusion row and 
below the NMOS diffusion row. Figure 2 illustrates the transistor placement, 
supply rails, contacts to the substrate and the polysilicon wires. This solution has 
the following advantages: 

• The silicon area may be smaller, since now there is no exclusive area for 
supply rails (they are placed over the transistors). 

• The length of the polysilicon wires is smaller, since the NMOS and PMOS 
transistors are placed close to each other. As the polysilicon has poor 
conductivity (RCpoiy = 5*RCmetaii). this feature improves the electrical 
performance of the cells. In our layout style the polysilicon layer is only used 
to connect transistor gates, except for short segments to solve cyclic 
constraints in the routing region. 

• The transistors can be freely sized. 




Figure 2 Transistors, body-ties, supply rails and polysilicon wires 

The drains/sources connected to a cell output are routed with the first metal 
layer (metal 1). A small vertical wire (metal 1) connects the output nodes to a 
horizontal line (also in metal 1) placed between N and P transistors. This topology 
is illustrated in figure 3. 




Figure 3 Output nodes 
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The number of contacts in each drain/source node is a function of the transistor 
width. In this way, more contacts will be placed in each drain/source, allowing the 
maximum current through the drain/source nodes to increase. To connect the 
drains/sources to the routing regions (upper and lower) we use a vertical stub in 
metal 1, to connect the gates we use a vertical stub in polysilicon. In this way, we 
can see the leaf cells as boxes, with I/O pins at the extremities (figure 4). 




Figure 4 Stubs and "interface line" 



This cell model is transparent to the third metal layer, which will be used 
vertically to make the feedthroughs. There are two types of feedthroughs: 

• when there is a net which crosses a row and this net is not in this row, in this 
case it is used a vertical wire in metal3 over the row. 

• when there is a net which crosses a row and the net is in this row, in this case 
it is used the I/Os pins of the cell which has the net (nets "A" and "B" in 
figure 4). 

The line between transistors and the routing regions is called "interface line". In 
this line, stacked contacts will be placed, since non-adjacent layers will be 
connected (polysilicon to metal2, metal 1 to rnetaB and optionally polysilicon to 
metal3). In this line it will be placed contacts between nets of the routing regions 
with the I/O pins of the cells, feedthroughs and body-ties. 

This model, with I/O pins in the extremities, can be inefficient for large 
transistors. The reason is the obstacle induced by the contact between the 
polysilicon wire and the routing region. A simple solution consists in .placing a 
vertical wire, in metal 1, over the polysilicon wire, with a contact in the middle of 
the cell (the horizontal routing of the output nodes must be now in metaB). In this 
way, the interface between gates and the channel routing will be done by a via, 
with no obstacles to transistor sizing. 

This was not employed as we use uniformly sized transistors and buffer 
insertion (inverters with 2 or more parallel transistors) after nodes having the 
output capacitance that exceed the fanout limit (Turgis, 1995). This solution 
minimizes the parasitic capacitances, mainly the side-wall capacitance. Layouts 
with very large transistors tend to increase excessively the parasitic capacitances, 
increasing delay and power consumption. 
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To sununarize, the layers are used in the following directions: 





diffusion 


polysilicon 


metal] 


metall 


metals 


cell level 


H 


V 


V/H 


H 


V 


routing channel* 


- 


- 


H 


V 


H 



H-horizontal, V-vertical, * preferential direction before optimization steps. 



3. OUTLINE OF THE MACRO-CELL SYNTHESIS TOOL 

To generate a macro-cell, our system executes the following tasks: leaf cell 
extraction, cell generation (transistor pairing), partition and placement, global 
routing and detailed routing. 

The leaf cell extraction isolates all basic cells of the input netlist (in Spice 
format, at the transistor or gate level). Basic cells are simple cells, represented by 
a dual graph, with n inputs and 1 (one) output. Our system deals with CMOS 
static gates and transmission gates. There are two advantages in using cell 
generation: (i) the first one is the possibility to individually size all transistors, 
according to the delay constraints, and (ii) to use complex gates (and-or-inverter 
or simply AOIs). The advantages in using AOI gates are: smaller area, delay and 
power. Comparing a standard-cell library with automatic cell generation 
(mapping with AOI gates, limited to 4 serial connected transistors), the average 
reduction in the transistor count is 35% (Reis, 1995). 

The next step, cell generation, fixes the transistor order of all basic cells. The 
main cost functions are: minimization of diffusion gaps and minimization of 
intra-cell routing (cell height reduction). The algorithm tries to find the same 
Euler path in both graphs (N and P plan) of each cell. If there is at least one 
common path between N and P plan, the cell will be generated with no diffusion 
gaps. The cell area is proportional to the number of inputs. 

Partition and placement are a single task. We use the quadrature algorithm 
(Fidduccia, 1982), with pin propagation (item 3.1). Global routing is executed 
after cell placement. The main function of the global routing is to determine 
where feedthroughs will be inserted. We can use vertical wires (metal3) over the 
cells or I/O pins of the cells, as explained in the previous section. The global 
routing generates a list of channels, which will be implemented according to the 
method described in section 3.2. 

The result of these tasks is a symbolic description of the macro-cell. Except for 
the transistor sizing, this description is fabrication process independent, allowing 
the use of any CMOS three-metal layer with stacked contacts process. This 
symbolic description is translated into layout by a compactor tool, e.g, the ones 
included in the CADENCE or MENTOR frameworks. 
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3.1. Partition and Placement 

A performance driven algorithm must consider the following cost functions; 

• avoid long wires; 

• avoid congestion areas into the channels, typically in the middle; 

• reduce the distance between cells of the critical path(s). 

We use the quadrature placement as basic algorithm. The quadrature placement 
(figure 5) alternatively divides the circuit into horizontal and vertical directions, 
minimizing the cutsize in each direction (Fidduccia, 1982). The circuit is initially 
partitioned in the vertical direction, using the min-cut algorithm, into two blocks 
with the same area. This first partition reduces the routing density in the middle 
of the channels, avoiding congestion in these areas. Next, each area is partitioned 
into two blocks with the area proportional to the number of rows. For example, for 
7 rows there might be an area ratio 4:3, for 4 rows 2:2 and so on. The partition 
procedure stops when no more horizontal partitions are possible (area ratio 1 :0). 
This process results in a set of quadrants, with few cells (typically 2-8) in each 
one. The cells in each quadrant can be placed using a simple algorithm. In our 
placement, cell order is obtained directly from the connectivity between cells. 



1 3b 3d 



















▲ 




▲ 












2a- 


•I 




T 


2b 










▼ 




▼ 




i\¥ 







3a 3c 



Figure 5 Quadrature placement 

The main problem of the quadrature is the placement of cells with common nets 
in non neighbour quadrants. For example, in figure 5, consider two cells with a 
common net between quadrants 2a and 2b. After vertical partition, these cells 
might be placed in quadrants 3b-3d, 3b-3c, 3a-3d or 3a-3c. In order to reduce the 
wire length, it will be interesting to place these cells in quadrants 3b-3d or 3a-3c. 

To improve the efficiency of the quadrature placement, we implemented a pin 
propagation method. Pin propagation tries to place cells with common signals in 
adjacent quadrants, reducing routing length. The main idea is the following: to 
make a partition within a quadrant, each quadrant already partitioned is also 
taken into account. For the horizontal partition, the quadrants are processed line 
by line, from left to right. For the vertical partition, the quadrants are processed 
column by column, from bottom to top. 

This algorithm avoids long wires and congestion, distributing homogeneously 
the connections in both directions (vertical/horizontal), reducing the track density. 
This is a fast algorithm, it takes for example, 2.22 minutes for a circuit with 3156 
gates in a Sun SparcS. 
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The next function to include in this placement procedure is the information 
related to the critical paths. This information will guarantee that cells in the 
critical path will be placed closer. 

3.2. DetaOed Routing 

The track density (lower bound) in each channel is defined by the clique of the 
horizontal graph. If the routing procedure uses only two layers to create wires, the 
track density will be equal to the clique, if there are no vertical cycles. If we have 
3 layers, the channel can be folded in the middle, making it possible to reduce 
routing area up to 50%. As routing area is responsible for 60% of the circuit area 
(average value for random logic blocks), we can expect a reduction of 30% in the 
total circuit area. 

The track density in our routing method is equal to the clique/2 (best case), 
superposing horizontally metal 1 and metal3, using metal2 in the middle. Another 
advantage of our method is the great number of suppressed vias, typically 40-55% 
(see table 1), also reducing parasitic elements. 

Our routing algorithm is divided into four steps; 

• initial double-layer routing; 

• metal2-to-metall and metal2-to-metal3 track transformation (vertical filter); 

• metal l-to-metal2 and metal3-to-metal2 track transformation (horizontal 
filter); 

• cycles solution. 

The first step uses a conventional double-layer greedy router (Rivest, 1982). 
When the double-layer router is finished, the odd tracks are superposed to the 
even tracks. Figure 6a shows the double-layer solution (in fact, there are 3 layers, 
two horizontal and one vertical), and figure 6b the layer superposition. A 3D 
illustration of the channel is shown in figure 6c. The channel can be seen as 2 
channels, an upper channel (metal3) and a lower channel (metal 1), with a 
connecting layer between them (metal2). 




(b) Three-layer routing Three-layer routing - 3D view 

Figure 6 Routing approach 
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The second step transforms metal2 vertical wires into metal 1 or metal3 vertical 
wires, suppressing unnecessary vias. The average suppression rate is 35 to 40%, 
in each channel. The third step transforms metal l/metal3 horizontal wires into 
metal2 horizontal wires. The average suppression rate in this step is 5%. In the 
channel routing there are no stacked vias. All stacked vias are in the "interface 
line" or in the power/ground connections. The "interface line" is responsible for 
making the connections between cells and channels. 

The fourth step solves via cycles. They arrive when there are adjacent contacts 
in the same column, resulting in a via2 superposed to a vial, and consequently a 
short-circuit. In the greedy router, there is a procedure to avoid cycles, since the 
track density is not increased. In this way, cycles can happen because the 
procedure to avoid them in the router may fail. These cycles are solved in the 
following manner: 

• usually, at least one via can be suppressed (vertical filter); 

• if the cycle remains, we use an "underground" solution, changing the vertical 
metal2 wire, connected to the horizontal metal 1 wire, to polysilicon (a fourth 
layer). In this way, we will have a contact under a via2, without electrical 
connection. Experiments under benchmark circuits show that this case occurs 
occasionally, usually up to 4 times in each channel. This vertical polysilicon 
is a short wire, with no impact in the parasitic capacitance of the cell. 

Table 1 illustrates in the third column the number of vias when using a greedy 
router with no optimizations (filters), in the fourth column the number of vias 
after vertical and horizontal optimizations, in the fifth column the via reduction 
rate and finally the total number of "underground" solutions in these circuits. The 
number of vias is in the channel routing, without the vias of the “interface line”. 



Table 1 Number of vias after horizontal/vertical filters (C* are ISCAS benchmarks) 



Circuit 


Transistor 

Number 


Vias^ 

Greedy 


Vias^ 

Optimized 


Reduction 

% 


"underground’' 

solutions 


adder 2 bits (AOIs) 


28 


31 


8 


0.74 


0 


adder 2 bits (gates) 


40 


34 


0 


1.00 


0 


c432 


150 


122 


18 


0.85 


0 


alu 4 bits (AOIs) 


260 


320 


138 


0.57 


5 


alu 4 bits (gates) 


424 


451 


224 


0.50 


1 


ripple carry adder 12 bits 


448 


624 


233 


0.63 


3 


carry lookahead adder 12 bits 


528 


725 


270 


0.63 


3 


booth multiplier 4 bits 


824 


831 


286 


0.66 


5 


c880 


1750 


2173 


1086 


0.50 


30 


cl355 


2244 


2710 


1269 


0.53 


28 


C1908 


3156 


4126 


2124 


0.49 


53 
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This routing procedure can be also used in cell-based approaches, aiming area 
and delay reduction. Compared to a double-layer router, this method guarantees a 
half number of tracks. As polysilicon conductivity can be increased using 
silicides, it will be interesting to study the resulting number of tracks for a 4 layers 
routing. 

The resulting macro-cell has no transparency because all layers are being used 
while routing. So, at the circuit level it will not be possible to implement the 
routing over macro-cells. One solution is to estimate the number of horizontal and 
vertical feedthroughs in each macro-cell, at the floorplane step of the chip. 



4. RESULTS 

It will be used the layout synthesis tool TROPIC (Moraes, 1994) for area and 
delay comparisons. This tool uses the linear-matrix layout style, without channel 
routing between rows. Horizontal routing is implemented in metal 1, between 
transistors, and the vertical routing in metall. TROPIC was compared to LAS 
(Cadence, 1991), an industrial layout generator, obtaining equivalent values for 
area and delay. Then, TROPIC will be the reference for two metal layer layout 
generation. 

Table 2 presents values for area and transistor density for TROPIC and our new 
layout style {3 metal). As expected, silicon area was reduced by 20 to 30%. The 
average transistor density is 4600 tr/mm^ (w=10|im) and 5500 tr/mm^ for 
minimal sized transistors. The exception was the ripple carry circuit, only 10%. 
To reduce this difference, it is necessary to insert jogs in the compaction step. 



Table 2 Area Comparison 



Circuit 


Transistor Area (mm^) 

Number TROPIC Smetal 


Density (Tr/mm?) 
TROPIC Smetal 


Difference 

Smetain'ROPIC 


adder 2 bits (AOIs) 


28 


0.0080 


0.0057 


3500 


4912 


0.71 


adder 2 bits (gates) 


40 


0.0110 


0.0069 


3636 


5800 


0.62 


c432 


150 


0.0446 


0.0282 


3363 


5319 


0.63 


alu 4 bits (AOIs) 


260 


0.0764 


0.0633 


3403 


4106 


0.82 


alu 4 bits (gates) 


424 


0.1480 


0.1158 


2864 


3662 


0.78 


ripple carry adder 12 bits 


448 


0.1179 


0.1068 


3800 


4195 


0.90 


carry lookahead 12 bits 


528 


0.1943 


0.1315 


2717 


4012 


0.67 



I.O /xm technology - 1=1 ^m, w=10^ - Compaction without jog insertion - Cioad = 100 fF 
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The macro-cell height is drastically reduced because the track number is 
reduced to half and the power/ground rails are implemented over transistors. 
However, the macro-cell length tends to increase, due to the great number of 
contacts in the “interface line”. To increase the transistor density, it is important 
to develop a specific layout compactor, adapted to the proposed layout style. The 
existing compactors are generic toots, highly time and memory consuming. 

Table 3 presents values for parasitic capacitances (routing and diffusion), delay 
of the critical path and average power consumption. As expected, the sum of 
parasitic capacitances was reduced (28% for alu AOI and 30% for carry 
lookahead). Consequently, delay and power are also reduced. The small difference 
in delay and power (5 to 13%) is due to the transistor sizing. 



Table 3 Delay Comparison (1.0 jm technology - l-ljjun, w-lOim - Chad = 100 fF) 



Circuit 


Transistor 

Number 


Total Cpar (pF) 
TROPIC Smetal 


Delay (ns) 
TROPIC 3metal 


Power (mw) 
TROPIC 3metal 


adder 2 bits (AOIs) 


28 


0.356 


0.310 


1.85 


1.62 


0.36 


0.34 


adder 2 bits (gates) 


40 


0.708 


0.654 


1.61 


1.51 


0.33 


0.27 


alu 4 bits (AOIs) 


260 


4.985 


3.541 


5.18 


4.96 


2.54 


2.35 


alu 4 bits (gates) 


424 


9.800 


7.645 


7.99 


6.42 


3.59 


3.10 


ripple carry adder 12 bits 


448 


6.115 


5.604 


16.24 


15.51 


8.48 


8.18 


carry lookahead 12 bits 


528 


9.139 


6.360 


13.43 


12.70 


8.53 


8.02 



In our examples all transistors are sized to 10pm, minimizing the influence of 
parasitic capacitances. To observe the real impact of parasitic capacitances in the 
layout style, it is recommended to use minimal sized transistors. 



5. CONCLUSION AND FUTURE WORK 

In this paper a new layout style was presented. It minimizes the diffiision 
capacitances and the polysilicon length by using three-metal layers for routing. As 
shown, the sum of parasitic capacitances was reduced, allowing to reduce area, 
delay and power. Future work includes: 

• Develop a specific layout compactor for the presented layout style. 

• Study the track reduction when using 4 (or n) metal layers for routing. 

• Improve the transmission gate topology in order to avoid waste area. 

• Improve the precision of parasitic capacitance estimation. 

• Take into account the critical path(s) in the placement algorithm. 

• In order to route over the macro-cells, insert feedthroughs to allow crossing 
wires. 







426 



Part Ten Physical Design Issues in Sub-micron Technologies 



REFERENCES 

CADENCE (1991) Virtuoso Layout Synthesizer - LAS - User Guide. 
CADENCE™ Version 4.2, October 1991. 

Fiduccia,A.E. and Matheyses,R.M.. (1982) A linear time heuristic for improving 
network partitions. 1 9th Design Automation Conference, pp. 175-181. 

Fukui,M.; Shinomuya,N. and Akini.T. (1995) A New Layout Synthesis for Leaf 
Cell Design. Asia South Pacific DAC, pp. 259-264. 

Hwang, C.; Hsieh.Y.; Lin,Y. and Hsu.Y. (1993) An efficient layout style for two- 
metal CMOS leaf cells and its automatic synthesis. IEEE Transactions on 
CAD, Vol. 12, No 3, March 93, pp. 410-423. 

Kim,J. and Kang,S.M. (1996) A New Triple-Layer OTC Channel Router. IEEE 
Transactions on CAD, Vol. 15, no. 9, September 1996, pp. 1059-1070. 

Kim,J.; Kang,S.M. and Sapatnekar,S. (1994) High Performance CMOS 
Macromodule Layout Synthesis. ISCAS'94, pp. 179- 182. 

Kim,S.; Owens,R.M. and Irwin,M.J. (1992) Experiments with a performance 
driven module generator. 29th Design Automation Conference, pp. 687-690. 

Lin,H.; Chou,C.; Hsu,Y. and Hwang.T. (1994) Cell Height Driven Transistor 
Sizing in a Cell Based Module Design. EDAC'94, pp. 425-429. 

Lin,H; Hsu,Y. and Hwang.T. (1996) Cell Height Driven Transistor Sizing in a 
Cell Based Static CMOS Module Design. IEEE Journal of Solid-State 
Circuits, Vol. 31, no.5. May 1996, pp. 668-676. 

Lopez, A.D. and Law,H.S. (1980) A Dense Gate-Matrix Layout for MOS VLSI. 
IEEE Transactions on Electron Device, Vol. ED-27, No. 8, August 1980, pp. 
1671-1675. 

Moraes,F.G.; Robert,R.; Auvergne,D. and Reis,R. (1994) An Efficient Layout 
Synthesis Approach for CMOS Random Logic Circuits. IX SBMICRO - 
Congresso da Sociedade Brasileira de Microeletronica, pp. 48-57. 

Reis,A.; Robert,M.; Auvergne,D. and Reis,R. (1995) Associating CMOS 
Transistors with BDD Arcs for Technology Mapping. Electronic Letters, Vol. 
31, No 14, July 1995. 

Rivest,R.L. and Fiduccia.,C.M. (1982) A Greedy channel router. 1 9th Design 
Automation Conference, pp. 208-219. 

Terai,M.; Nakajima,K.; Takahashi,K. and Sato,K. (1994) "A New Approach to 
Over-the-Cell Channel Routing with Three Metal Layers". IEEE Transactions 
on CAD, Vol. 13, No 2, February 94, pp. 187-200. 

Turgis,S.; Azemard,N. and Auvergne,D. (1995) Design and Sizing of Tapered 
Buffers for Minimum Power-Delay Product. PATMOS'95, pp. 74-90. 

Uehara,T. and Cleemput,W. (1981) Optimal Layout of CMOS Functional Arrays. 
IEEE Transactions on Computers, Vol. C-30, No. 5, May 1981, pp. 305-312. 



More information: http://www.inf.pucrs.br/~moraes/tropic.html 





35 



Coupled Circuit-Interconnect 
Modeling and Simulation 

Lufs Miguel Silveira 

Cadence European Laboratories/INESC, Dept, of Electrical and Computer 
Engineering, Instituto Superior Ticnico 

Rua Alves Redol, 9, 136, 1000 Lisboa, Portugal, Tel: 351-1-3100337, Fax: 
351-1-3145843, lms@inesc.pt 

Mattan Kamon 

Research Laboratory of Electronics, Dept, of Electrical Eng. and Comp. Sci- 
ence Massachusetts Institute of Technology 

77 Massachusetts Ave., Room 36-881, Cambridge, MA 02139, USA, Tel: 
(617)-253-1205, Fax: (617)-258-7864, matt@rle-vlsi.mit.edu 



Abstract 

In this paper we discuss generating low order models for efficient coupled circuit- 
interconnect simulation. The ever increasing speeds and shrinking feature sizes that 
are typical of state of the art integrated circuits designs have made coupling due to 
interconnect and packaging a very important, sometimes dominant, factor in system 
performance. The ability to efficiently perform coupled circuit-interconnect simula- 
tion before fabrication is essential in order to detect signal degradation due to delays 
or crosstalk. We first discuss methods of generating models for both two and three 
dimensional interconnect and then present a general, guaranteed-stable, model order 
reduction technique to reduce the order of the interconnect models. 
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1 INTRODUCTION 

The dense interconnections and packaging structures used in compact electronic 
systems often produce electrical and magnetic interactions which interfere with sys- 
tem performance. With higher speeds, shrinking feature sizes and increasing com- 
plexity, the electrical and magnetic characteristics of such structures are becoming 
increasingly important factors in the behavior and performance of currently fabri- 
cated integrated circuits. Interconnect effects such as ringing, reflection, crosstalk, 
dispersion and attenuation may corrupt logic signals and prevent a circuit from func- 
tioning as specified. Such effects are difficult to simulate because they occur only as 

VLSI: Integrated Systems on Silicon R. Reis & L. Claesen (Eds.) 

©inP 1997 Published by Chapman & Hall 




428 



Part Ten Physical Design Issues in Sub-micron Technologies 



a result of an interaction between the field distribution in a complicated geometry of 
conductors, and the circuitry connected to those conductors. 

In the design of communication, high-speed digital, and microwave electronic sys- 
tems, long interconnect lines from printed circuit boards and packaging can also have 
an important impact on system performance. The electromagnetic fields produced 
by these lines can be approximated as two-dimensional and for this reason, includ- 
ing non-ideal transmission lines in circuit simulation has become a topic of much 
current research (Lin & Kuh 1992, Celik & Cangellaris 1996). In general, the be- 
havior of stripline and microstrip printed circuit board traces, interchip connections 
on multi-chip modules, and coaxial cable connections are most easily represented by 
frequency-dependent scattering parameters. Since the scattering parameters may be 
derived from measured data, detailed finite-element simulation, or analytic formulas, 
a general approach to including transmission lines in circuit simulators is to allow 
for frequency-dependent elements specified by tables of data. The most straightfor- 
ward approach to including general frequency-domain transmission line models in 
a circuit simulator is to calculate the associated impulse response using an inverse 
fast Fourier transform (Schutt-Aine & Mittra 1988). Then, the response of the line 
at any given time can be determined by convolving the impulse response with an 
excitation waveform. Such an approach is too computationally expensive for use in 
general circuit simulation, as it requires that at every simulator timestep, the impulse 
response be convolved with the entire computed excitation waveform. An alterna- 
tive approach is to approximate the frequency-domain representation with a rational 
function, in which case the associated convolution can be accelerated using a recur- 
sive algorithm (Lin & Kuh 1992). Very efficient circuit simulation programs which 
handle RLCG transmission lines have been developed using such an approach, where 
the rational function approximation was derived using Fade or moment-matching 
methods (Bracken, Raghavan & Rohrer 1992). 

For three dimensional structures small compared to a wavelength, electromag- 
netic interactions between conductors can be represented arbitrarily accurately using 
a densely coupled resistor, inductor, and capacitor (RLC) network (Ruehli 1974). 
Although it is possible to simulate coupled circuit-interconnect problems by in- 
cluding this densely coupled RLC network with the transistor models in a circuit 
simulator, this can be a very inefficient approach. In fact, to reasonably accurately 
model complicated interconnect and packaging these RLC models require a very 
large number of energy storage elements, and this will make subsequent circuit 
simulations using the models prohibitively expensive. Impulse response or recur- 
sive convolution techniques could be used, but more recently, model-order reduction 
schemes have been developed to reduce the number of these energy-storage ele- 
ments without compromising the transient response properties of the model (Pillage 
& Rohrer 1990, Bracken et al. 1992, Silveira, Kamon & White 1995, Silveira, Ka- 
mon, Elfadel & White 1996). 

In this paper we discuss model order reduction via the Arnoldi process for ef- 
ficient simulation of both two-dimensional and three-dimensional interconnect. In 
section 2 we will describe some of the techniques used in the accurate modeling 
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of interconnect and packaging structures. Then, in section 3 we discuss commonly 
used techniques for performing model order reduction on the generated interconnect 
models and describe in some detail the Arnoldi-based model order reduction algo- 
rithm. Next, in section 4 we review the relevant issues that are involved in including 
interconnect models into standard circuit simulators such as SPICE or SPECTRE. In 
section 5 we describe an example that illustrates the techniques described. Finally, 
in section 6, we present conclusions and acknowledgments. 



2 INTERCONNECT AND PACKAGE MODELING 

In this section we describe some state-of-the art techniques available for gener- 
ating models of two and three-dimensional interconnect amenable to model order 
reduction. We start in Section 2.1 by describing one approach for generating mod- 
els from measurements, analytical models, or the output of an electromagnetic tool. 
Then in Section 2.2 we briefly describe the formulation used in a 3D electromagnetic 
modeling tool so that in Section 3 we can describe integrating model order reduction 
directly into an electromagnetic tool. 



2.1 Modeling Transmission Lines 

Efficient modeling of transmission lines involves fitting some appropriate matrix 
or input-output relationship to a rational function. The matrix or input-output rela- 
tionship typically comes from measurement, an analytical model, or an electromag- 
netic tool which provides a scattering formulation, propagation factors, or admit- 
tance matrix over a range of frequencies. First, the ideal mode delays of the line are 
removed and then the rest of the propagation function is approximated by a ratio- 
nal function of an appropriate order. The mode delays can be returned later during 
simulation. 

A numerically robust approach for computing a rational function approximation 
is to use frequency partitioning to reduce the dynamic range of the data. A set of 
local rational function approximations is first computed by linear fitting on each 
partitioned frequency interval. The set of all poles generated from the local approxi- 
mations form an accurate set for a global rational function. A final global fitting can 
then be performed to adjust the residues of the approximation and to enforce correct 
steady-state computation on the model. Unfortunately the linear fitting of the local 
approximates is not guaranteed to produce stable poles. However, due to the par- 
titioning, low order approximations are usually accurate enough in each section to 
avoid unstable poles. Moreover, unstable poles can normally be discarded without an 
impact on accuracy (Silveira, Elfadel, White, Chilukura & Kundert 1994, Nguyen, 
Li & Bai 1996). 

Usually the construction of the global approximation from the local approxima- 
tions introduces redundancies resulting in many more poles than necessary. It is 
therefore desirable to further reduce the order of the approximation. Reducing the 
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order can be accomplished by first writing the system in state space form followed 
by the model order reduction techniques described in Section 3 (Silveira et al. 1994). 



2.2 Modeling 3D Coupling in Packaging Structures 

A wide range of integrated circuit and packaging design problems require accu- 
rate estimates of the coupling inductances of complicated three-dimensional struc- 
tures. The frequencies of interest generally require magnetoquasistatic analysis, and 
a common approach is to apply finite-difference or finite-element techniques to a dif- 
ferential problem formulation. However, finite-element techniques require that the 
entire 3-D volume be discretized, and for complex structures, the generation of such 
a volume discretization can become cumbersome and require prohibitive execution 
time. Instead, volume-element methods can be applied to solving integral formu- 
lations of the problem in which case only the interior of the conductors need be 
discretized. Discretizing only the interior leads to many fewer unknowns than the 
finite-difference and finite-element techniques however the resulting system is dense 
which can be too expensive to solve by direct matrix factorization. 

More recently methods for magnetoquasistatic analysis of complicated three-di- 
mensional packages and interconnect have been proposed that allow the modeling of 
extremely complex packaging structures (Kamon, Tsuk & White 1993). These meth- 
ods use a standard volume-element discretization of an integral formulation from 
magnetoquasistatic analysis also known as the Partial Element Equivalent Circuit 
(PEEC) method (Ruehli 1979), but reformulate the discretized equations using mesh 
analysis. The mesh formulation leads to a dense system of equations which is solved 
iteratively using a rapidly converging Krylov-subspace method known as General- 
ized Minimal Residual (GMRES) (Saad & Schultz 1986). Finally, since the system 
of equations is dense, the matrix-vector product required at each iteration of GMRES 
is expensive and to reduce its cost, the fast multipole method is used (Greengard & 
Rokhlin 1987, Nabors & White 1992). The combination of these techniques has 
been implemented in a packaging analysis program, named FastHenry, whose 
computational complexity and memory requirements grow linearly with the number 
of volume-elements required to discretize the conductors. 

In FastHenry each conductor in a f-conductor system is approximated as piece- 
wise-straight sections. To model skin and proximity effects, the volume of each 
straight section is then discretized into a collection of parallel thin filaments through 
which current is assumed to flow uniformly. By applying the Method of Moments to 
the magnetoquastatic equations, one can assign to each filament a resistance, induc- 
tance, and mutual inductance to every other filament in the problem. 

A system of equations for the filament currents can be derived by assuming that 
the applied currents and voltages are sinusoidal, and that the system is in sinu- 
soidal steady-state. Following the partial inductance approach in (Ruehli 1972), the 
filament current phasors can be related to filament voltage phasors by a branch 
impedance matrix, Z = R + juL, 
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The planar graph interconnecting the filaments creates a circuit and combining this 
formulation with Kirchhoff ’s voltage law via mesh analysis, the following equation 
relating the vector of mesh currents and the source branch voltages can be obtained 

MZM'^Im = V,. (1) 



where M € is the mesh matrix, € C" is the mostly zero vector of source 
branch voltages and Im ^ is the vector of mesh currents. 

For a specific a;, the complex admittance matrix which describes the external ter- 
minal behavior of a t-conductor system, denoted Yt = can by derived from 
(1) by noting that It = YtVt where It and Vt are the terminal source currents and 
voltages of the f-conductor system. These values are related to the mesh quantities 
by It = N^Im, Vs = NVu where N G is an easily constructed terminal 
incidence matrix. Hence, to compute the column of Yt, set VJ . = 1 and the rest of 
VJ . to zero. Solve (1) with a Vi = NVt and then extract the entries of J^n associated 
with the source branches via It = 

One approach to coupling the above package models with circuits is to include 
a sparse tableau version of (1) in a circuit simulator instead of solving for the ter- 
minal behavior (Ruehli 1974). This approach has the drawback that the size of the 
system in (1) can easily be very large if high-accuracy is desired. Another possibility 
is to construct and solve (1) for various values of uj and then use the techniques de- 
scribed in 2. 1 to compute a model for the package. However, a more computationally 
efficient approach is to form a state-space representation of (1), 



j^{MLM^)U = -{MRM^)I^-i-NVt 
It = N'^Im. 



( 2 ) 



where Z has been expanded to Jl -h juL, and ju replaced by Then model or- 
der reduction techniques as described in the next section can be applied to derive a 
smaller approximation. 



3 MODEL ORDER REDUCTION 

Since the first papers on asymptotic waveform evaluation (AWE), Pad6-based re- 
duced order models have become standard for improving coupled circuit-interconnect 
simulation efficiency. A Pad6 approximation of order is defined as a rational func- 

tion whose coefficients are selected to match the first 2q-l moments of the transfer 
function of the system. For a SISO system given in state-space form 



= a; -f 

T 

= X. 



Ax 

y 



( 3 ) 
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or in transfer function form G{s) = ( J - sA) ^ b = , the moments 

are given as rrik = c^A^b. Low order Fade approximates can be computed us- 
ing direct evaluation of the moments, followed by a moment-matching procedure. 
In order to accurately compute higher order Fade approximates, it is necessary to 
use successive bi-orthogonalization combined with lookahead, as in the recent non- 
symmetric Lanczos algorithms (Feldmann & Freund 1995). Although nonsymmetric 
Lanczos methods plus lookahead can be used to generate Fade approximates of ar- 
bitrarily high order, there is no guarantee that a given approximate will be stable. It 
is therefore essential to postprocess the Fade approximate before using it in a circuit 
simulation program. 

An alternative approach, which robustly generates a somewhat different approxi- 
mation, can be derived using an Arnoldi process as in the GMRES algorithm. The idea 
behind this approach is similar to that of (Feldmann & Freund 1995), and is that of se- 
lecting an orthonormal basis for the Krylov subspace Kk {A, b) = span{b, Ab, A?b, 

• • • , The orthogonality between the basis vectors makes the Arnoldi algo- 

rithm a better conditioned process than direct evaluation of the moments. 

After q steps, the Arnoldi algorithm returns a set of q orthonormal vectors, as the 
columns of a matrix Vq and diqxq upper Hessenberg (tridiagonal plus up- 

per triangular) matrix Hq. These two matrices are related by AVq — Hq. From 
this relation it can easily be shown that the moments of the transfer function can be 
written in terms of Hq and Vq. Therefore the q^^ order Arnoldi-based approximation 
to G{s) can be written as 

Gf{s) = || 6||2 V, (/ - ei (4) 

corresponding to the state-space realization A, = Hg,bg = ei.andc, = || 6||2 c. 

Block extensions to the Arnoldi approximation method, as well as the Lanczos 
methods, can be derived, which allow for direct computation of reduced-order mod- 
els for interconnect or packaging structures with multiple inputs and outputs from 
the general MIMO system in (3). 

For certain classes of RC, RL or LC circuits it has been shown that congru- 
ence transforms, like the Arnoldi algorithm, can generate guaranteed stable and pas- 
sive reduced-order models. Recently a coordinate-transformed Arnoldi algorithm has 
been presented, that can generate arbitrarily accurate and guaranteed stable reduced- 
order models for RLC circuits. Efficient implementation of this algorithm requires 
simple modifications to the standard Arnoldi algorithm and its computational cost is 
roughly equivalent to that of FVL (Silveira et al. 1996). 

Application of the Arnoldi technique to the 3D modeling problem is straightfor- 
ward from a state-state formulation of the problem. For instance, for the 3D coupling 
problem in packaging structures described in Section 2.2, and from Eqn. (2) for the 
SISO case, the state-space formulation is readily obtained with A = —{MRM^)~^ 
(MLAf ^), b — {MRM^)~^Nj and c = Ni, where Ni indicates the column 
of AT. 
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Note that the computation of b is inexpensive since MRM^ is sparse. Also, 
because L is dense, the dominant cost of each step of an Arnoldi process is a matrix- 
vector product. Ax — -{MRM'^)~^{MLM^)x. In practice, the matrix-vector 
cost dominates even when the dense part, (MLM^)x, is rapidly computed with a 
hierarchical multipole-algorithm as in FastHenry. 

A modified version of the standard Arnoldi algorithm which implements the coor- 
dinate-transformation described in (Silveira et al. 1996) can easily be derived. This 
algorithm can be shown to efficiently generate arbitrarily accurate and guaranteed 
stable reduced-order models for this problem. 



4 TIME-DOMAIN SIMULATION 

In order to account for the interconnect or package coupling effects when verifying 
the correctness of a design, it is necessary to include the interconnect model in the 
verification tool. Usually this implies that the generated model must be amenable to 
inclusion in a standard circuit simulation tool such as SPICE (Nagel 1975) or SPEC- 
TRE. 

If the coupling effects are being obtained through some experimental procedure 
or from repeated usage of some detailed analytical tool that generates frequency- 
dependent values, then the standard way to include this data into a simulator is via a 
convolution process at each timepoint. From the frequency-data, Laplace inversion 
or application of the inverse fast Fourier transform can be used to compute a corre- 
sponding impulse response which is then convolved with the excitation waveform. 
This approach is computationally demanding since at every simulator timepoint, 
the impulse response must be convolved with the entire computed excitation wave- 
form. However, if some weak assumptions are made about the time-domain response, 
namely that it becomes “smooth” for large t, then the cost of computing the convo- 
lution integral can be significantly reduced (Kapur, Long & Roychowdhury 1996). 

If the frequency-domain data is approximated by a rational function as described 
in Section 2.1 then a recursive algorithm can be used which reduces the cost of 
computing the convolution integral at each timepoint to a constant factor (Lin & 
Kuh 1992). 

A different approach can be applied if the interconnect model is in a state-space 
form. In this case it is possible to directly include the model into the simulator, given 
that the model itself is simply a set of ordinary differential equations, similar to the 
circuit equations themselves. Therefore one can simply enlarge the list of variables 
that the simulator will solve for by adding to it the state vector. Given that the compu- 
tational cost of circuit simulation grows superlinearly with the number of unknowns, 
it is important that the interconnect model be as small as possible. 

Of course if the generated model is realizable, which is not always the case, then 
it is also possible to synthesize an RLC circuit from the state equations. In this case 
no modifications need to be introduced to the simulator and the model can be can be 
used directly. 
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Figure 1 (a) Seven pins of a cerquad pin package, (b) General configuration for 
the connection between receiver and driver chips. All the circuit elements inside the 
same chip share that chip’s power and ground. 

5 EXPERIMENTAL RESULTS 

In the preceding sections, we described algorithms to compute reduced order mod- 
els of interconnect and packaging structures using either Fade or Arnoldi-based al- 
gorithms. In this section we show how such models can be used for coupled circuit- 
interconnect simulation. First the accuracy of the approximations is examined when 
used to obtain reduced-order models for the frequency-dependent admittance for a 
small set of package pins. The model generated for the packaging strcuture is used in 
a coupled circuit interconnect simulation that shows how crosstalk between the pins 
of the package can affect the performance and integrity of the signals. 

Consider the small set of package pins, as shown in Figure 1-a). To compute the 
resistance and inductance matrices with FastHenry, the pins were discretized into 
three filaments along their height and four along their width producing a system of 
size m = 887. This allows modeling of changes in resistance and inductance due to 
skin and proximity effects. The model to be produced for this seven terminal system 
has seven inputs and seven outputs. 



Coupled circuit-interconnect modeling and simulation 



435 



Bode plots for mutual admittance between pins 1 and 2 




frequency (rad/s) 



Figure 2 Bode plot for an Arnold! approximation to the coupled admittance trans- 
fer function between pins 1 and 2. The curves are indistinguishable from the exact 
solution which is also plotted. 

5.1 Accuracy Comparisons 

Figure 2 shows the Bode plot of an 8^^ order Arnoldi-derived approximations to 
the coupled admittance transfer function between pins 1 and 2. Also shown in the 
picture is the exact admittance transfer function obtained by eigendecomposition of 
the full system. As can be seen from the plot, the approximated model is virtually in- 
distinguishable from the full transfer function. This same accuracy could be obtained 
by constructing an approximant from the frequency data-poiiits, as described in Sec- 
tion 2.1 (Silveira et al. 1995). However this latter technique would be extremely 
inefficient in this situation, since if if discrete frequency data were generated, gen- 
erating 50 frequency points would require roughly 30 times the number of Arnold! 
iterations needed by the direct state-space approach described in Section 2.2. 



5.2 Coupled Simulation Results 

To investigate the crosstalk effects between the package pins in Fig 1-a), the con- 
figuration shown in Fig. 1-b) is used where it was assumed that the five middle lines 
carry output signals from the chip and the two outer pins carry power and ground. 
The signals are driven and received with CMOS inverters which are capable of driving 
a large current to compensate for the impedance of the package pins. The capacitance 
is assumed to be 8pF and the interconnect from the end of pin to the receiver is mod- 
elled with a capacitance of 5pF. A 0.1 pF decoupling capacitor is connected between 
the driver’s power and ground to minimize supply fluctuations. The frequency de- 
pendence of each element in the admittance matrix is modeled via Arnoldi-based 
approximations of 8^^ order. These models are then incorporated into SPICE3 as a 
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Input to receiver of pin 4 




Output from receiver of pin 4 




Figure 3 Results of the timing simulation for the output of the receiver gate con- 
nected to pin 4 when the adjacent pins switch Ins after pin 4. Comparison of the 
waveforms on Pin 4’s receiver using a model constructed from frequency dependent 
data (full), a constant value from low frequency (low), and a constant value from 
high frequency (high). 



frequency-dependent voltage-controlled current source VCCS. As a sample time do- 
main simulation, imagine that at time to — 4ns the signal on pin 4 of Fig.l-b) is to 
switch from high to low and pins 2, 3, 5, and 6 are to switch from low to high but that 
due to delay on chip, pins 2, 3, 5, and 6 switch at ti = 5ns. In this case, significant 
current will suddenly pass through the late pins while pin 4 is in transition. Due to 
crosstalk, this large transient of current has significant effects on the input of the re- 
ceiver on pin 4, as shown in the solid line in Fig. 3. Note that the input does not rise 
monotonically. The curve also also shows that the bump in the waveform is carried 
through to the output of receiver, as a large glitch. 

As a demonstration of the importance of frequency dependent analysis. Fig. 3 
also shows the waveforms resulting from simulations using constant resistance and 
inductance values corresponding to the high or low frequency limits (the dashed and 
dotted waveforms respectively). Note that for the receiver input waveforms, the large 
voltage bump swings by approximately 0.5F more for the full frequency-dependent 
case. While this is small on the input, this is a very sensitive region for the receiver 
and doubles the size of the output glitch. 
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6 CONCLUSIONS AND ACKOWLEDGEMENTS 

In this paper we reviewed some of the commonly used techniques for generating 
accurate models for interconnect and packaging coupling, that are amenable to effi- 
cient circuit level simulation. As an example of the capabilities and effectiveness of 
such techniques, an approach for computing reduced-order models of magnetoqua- 
sistatic coupling in complicated 3-D structures was reviewed. This approach shows 
how the combination of an appropriate formulation of the problem with the use of 
robust numerical techniques can lead to models that can be used to produce results 
within acceptable timeframes without compromising accuracy. Numerical examples 
were presented that show how accurate reduced-order models can be generated and 
used in coupled circuit-interconnect simulation. 
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Abstract 

This paper presents the performance of the static timing analyzer TAS for deep sub- 
micronic CMOS technologies. The methodology used by TAS is given with special 
emphasis on its Short Channel MOS model. Results are given to show the accuracy 
of the static timing analyzer TAS for various CMOS circuits (including pass transis- 
tor and precharge logic) as well as for various CMOS processes ranging from 1 2p to 
0.35/i. The Short Channel MOS model of TAS appears to be relevant to the analysis 
of deep submicronic processes. 
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1 INTRODUCTION 

The timing analysis problem, on a synchronous sequential logic circuit where each 
register is enabled by a single clock, is to find out the lowest clock period at which the 
circuit operates correctly. This problem is solved in two main phases : a functional 
analysis of the sequential behavior (disassembling phase, timing independent) fol- 
lowed by a timing dependent analysis of the combinational behavior (Sangiovanni- 
Vincentelli etal 1996). 

The various types of models describing the timing behavior of CMOS circuits 
demonstrate the compromise between simulation accuracy and simulation expense 
(Wunder et al. 1996). 

Dynamic electrical simulation that solves nonlinear transistor differential equa- 
tions like SPICE (Nagel 1975) or ELDO (Anacad 1996) are often considered to be 
impracticable because of the computation time (Cheng et al. 1992, Sangiovanni- 
Vincentelli et al. 1996, Wunder et al. 1996) and yet it is the only acknowledged 
reference for timing specifications of a circuit before foundry. 

Static timing analyzers (Ousterhout 1985) are well known to bring a less accurate 
but practicable solution to the analysis problem. Their ability to handle much larger 
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circuits results from the use of a pattern independent approach, together with simpler 
models of gate delays. They exploit the unidirectional flow in MOS circuits. 

Existing commercial timing analysis tools (Schulz 1995) like Cadence, Compass, 
Chronology, EPIC, Mentor Graphics, Synopsys and ViewLogic use different ap- 
proaches regarding the type of analysis performed (static or dynamic), the input 
netlist (transistor level, gate level or RTL level) and the type of timing model of 
the considered devices (RC model of the transistor, nonlinear equation table, nonlin- 
ear model of the transistor, nonlinear table of gate commutation). Very few perform 
static timing analysis on transistor level netlists with a nonlinear model of the tran- 
sistor (only EPIC does). 

Currently the complexity of large circuits can reach several million transistors. 
Designing such circuits requires the use of decomposition into megablocks of 10k- 
50k transistors followed by connection of the blocks to produce the complete design. 
Like Epic (Schulz 1995) we assume that transistor level static timing analysis is 
performed on each of the megablocks. Then the analysis is performed at higher levels 
following the hierarchical approach used for the design. Here we concentrate on the 
analysis of each individual megablocks. 

TAS is a transistor level static timing analyzer for CMOS circuits (Hajjar et al. 
1991) and takes a post layout SPICE netlist as input. It brings an original and ac- 
curate solution to characterization of complex Asics. The speeding up relies on the 
underlying gate structure of any digital CMOS circuit. The transistor netlist is au- 
tomatically partitioned into a directed gate netlist. Each gate is then characterized 
taking into account signal slopes and reduced to an equivalent inverter. The compu- 
tation of the delays uses the original TAS Short Channel MOS (TSCM) model of 
the transistor. In order to detect critical paths, TAS makes ‘worst case’ timing as- 
sumptions (Ousterhout 1985). The initial implementation of TAS was developed by 
Hajjar et al. (1991). It worked together with the gate abstractor DESB (Greiner et 
al. 1992). Its performance was evaluated on CMOS circuits for a technology larger 
than 1/i. Now TAS works with another gate abstractor called YAGLE (Lester 1994) 
that allows the disassembling of much larger circuits than DESB, as well as special 
circuitry. Results are given here to show the accuracy of TAS for various CMOS 
circuits (including pass transistor and precharge logic) as well as for various CMOS 
processes ranging from 1.2/i to 0.35//. The Short Channel MOS model of TAS ap- 
pears to be relevant to analyze deep submicronic processes since most of TAS results 
are within 5% versus the SPICE ones. 

The paper is organized as follows: Section 2 introduces the main principles of 
TAS. Section 3 validates previous assumptions on a representative set of CMOS 
circuits by comparisons with SPICE simulations. 



2 TAS PRINCIPLES 

TAS works on a SPICE netlist obtained by a classical layout extractor. The analysis 
is based on analytical equations that take into account the following factors: 
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• The current versus voltage characteristics of Short Channel Mosfets are modeled 
with nonlinear equations. 

• The signal slopes effects, that is commutation time of input signal (Auvergne et 
al 1990, Dagenais etal 1992) are computed with an RC model. 

• The transient short circuit current during the commutation of a gate (Turgis et al 
1997) is approximated to an additional load capacitance, called here the conflict 
capacitance. 



Exhaustive analysis of all paths that may be sensitized is performed in order to find 
the resulting worst possible timing behavior. 

TAS works on a post layout transistor level netlist. The first main phase of the anal- 
ysis is timing independent. It consists in extracting the functionality of the netlist. 
It is performed in two steps, the first is the disassembling phase to build a gate netlist 
(section 2.1), it is followed by the construction of the graph considering the possible 
transitions of signals (section 2.2). The second main phase is timing dependent. It 
consists in the computation of gate delays (section 2.3), followed by the determina- 
tion of the longest and shortest paths (section 2.4). 



2.1 The gate netlist 

From the extracted transistor netlist, a ‘pseudo gate’ netlist is built according to the 
algorithm used in the partitioning tool YAGLE (Lester 1994): 



• Each signal Y driving a transistor gate is the output of a single gate. 

• The gate labeled Y is the collection of electrical paths between Y and the supplies 
(power or ground). 

• There are two kinds of paths. The UP paths from Y IoYdd (power supply) and 
the DOWN paths from Y to Vss (ground supply). 

• A path is thus made of series connected transistors. Such a path is also called a 
branch. It is similar to the stage structure introduced by Ousterhout (1985) in his 
timing verifier CRYSTAL, for CMOS circuits. 



The transistor orientation is also solved by YAGLE. Therefore a gate is oriented, has 
one output and several inputs and is made of several branches. A first graph is built 
(dependence graph) with false branch elimination (non functional) where : 



• the nodes are the signals; 

• the oriented edges are the dependences between signals (existence of a gate input- 
output relationship). 
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2.2 The causality graph 

From this first graph TAS builds the causality graph : 

• A node is one of the two possible events associated to a signal (LOW-HIGH 
transition and HIGH-LOW transition). Hence, each node of the first graph will 
make two nodes in the causality graph. 

• An oriented edge exists between two nodes if the transition of the output node can 
be actually driven by the transition of the input node (for an inverting gate there 
are only two possible edges, for XOR gates there are four edges). 



2.3 The gate delay computation 

From the causality graph, TAS builds the event graph by computing the propagation 
delay between two nodes in two steps : 



1 . First TAS performs a local evaluation of the input slopes. This is done by ignoring 
the coupling effect of the actual switching input and by using a simple RC model 
of the transistors to evaluate locally the worst case commutation time of the gate 
output. To each node i of the event graph is thus associated an intrinsic slope Si, 
depending only on : 

• The geometry of the gate which has node i asm output. 

• The equivalent capacitive load on node i, Ci that takes into account the drain 
capacitances that can be discharged, the gate capacitances of the transistors 
having i as an input - including those seen through pass transistors -, the post 
layout capacitances of the wires connected to node i. 



2. Second, TAS computes the gate delay associated with each edge in this graph. 
For each couple (input,output) TAS creates a set of branches associated with the 
output by looking at all the branches found by the disassembling that have the 
same input and output. TAS assumes that only a single transistor is turning on 
or off at once. This a popular assumption in the area of delay modeling of gates, 
however studies begin on the case of multiple input switching in close temporal 
proximity (Chandramouli et al. 1996). To set the state of the other transistors 
belonging to the switching gate, TAS finds the combination of values that results 
in the worst possible timing behavior, that is the worst case assumption used by 
Ousterhout (1985) and Dagenais etal (1992). 

Then TAS reduces the branches to a unique inverter - the equivalent inverter - 
by analyzing the maximum current that can flow during the switching. The TAS 
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original Short Channel MOS model (TSCM) introduced by Hajjar et a/. (1991) is 
used . In this model the drain current of a MOS transistor is expressed as follows : 

sub threshold mode : 

Vgs < Vt 
linear mode : 

VdS < Vs AT 
saturation mode : 

VdS > Vs AT 

saturation voltage : 

Vs AT = K{Vgs — Vt) 

where Vt is the threshold voltage, Vs at the saturation voltage, A and B are the 
transconductance parameters in the saturation region, Rt the resistance in the 
linear mode. 

The parameters in the TSCM model are generated by running SPICE simulations 
on simple fitting circuits, made of simple branches (Dioury et al. 1997) : Figures 
1 and 2 show the static characteristics of the N MOS transistor for the TSCM 
model and the SPICE level 3 model for a 0.8 /i technology. Figures 3 and 4 show 
the same for the P MOS. The TSCM model is simpler than the one studied by 
Sakurai et al. (1990), yet it is also well fitted to take into account the particular 
effect of short channel transistor like the carrier’s velocity (Sakurai et al. 1991). 
The equivalent inverter is then passed to a delay modeler that computes how long 
it will take for the output of the branch to change value. The delay modeler uses 
the TSCM model to compute the time interval between the time when the input 
i rises Vdd/*^ and the time the the output j rises Vdd/"^- The delay Uj depends 
on : 

• The slope on node i, 5» . 

• The geometry of the gate which has node j as output. 

• The equivalent capacitive load on node j, Cj. It is modeled like the above 
mentioned C{ for slope computation, with addition of the conflict capacitance. 

In the following we give more details on computing Uj. We approximate signal 
waveforms by an hyperbolic tangent function (Hajjar 1992), the input signal on 
node i is expressed by : 

t <. 0 -A V <Vt 

t>0-^V = Vr-^{Ui- VTth{t/Si)) 
where : 

Si is the slope on node i and Ui is the maximum voltage reached by the input. 
Usually it equals Vdd except when the input is controlled by a pass transistor. 



/ = 0 ; 

I = Vds!Rt\ 

IsAT = A{Vgs - VrfKl + B{Vgs ~ Vt))\ 
with 0 < iT < 1 . 
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Then the high to low delay t^J' from input i to output j discharging the equivalent 
capacitance Cj is computed under the assumption that when the output j goes 
from Vdd to VddI^ the current Imax{I) driving Cj can be approximated using 
equation 1 - saturation mode - and equation 2 where A, B, Si and U{ are replaced 
by A\ B\ Si and f// to model the series connected transistors in the switching 
branch (Hajjar 1992). 

The resulting equation is: 

X 

lMAx{t)dI' = Q = 0.5CjVDD (3) 

when the transistor operates in the saturation mode the drain current I max (I) 
depends only on Vqs as in equation 1. Combining equations 1, 2 and 3 gives : 

X 

A’.U-Hh?{tlSi)/{l + B'.Uith(tlS^)).dt = Q = 0.5 CjVdd (4) 

The delay, that has been defined as the time interval between the instant when 
the input i reaches Vdd/"^ and the instant when the output j reaches VddI'^^ is 
derived through an iterative process from the solution X of equation 4 by : 



tff = X- Argth(0/oVDD/Ul) 




Figure 1 Drain current versus voltage 
I{Vds) for Vgs = 4.57 of the TSCM 
model (NMOS) compared with the Spice 
level 3 model, for a 0.8 /i technology 



(5) 




Figure 2 Saturation current versus volt- 
age I{Vgs) of the TSCM model 
(NMOS) compared with the Spice level 
3 model, for a 0.8 /i technology 
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Figure 3 Drain current versus voltage 
I{Vds) for Vgs = 4.5F of the TSCM 
model (PMOS) compared with the Spice 
level 3 model, for a 0.8 p technology 




Figure 4 Saturation current versus volt- 
age I {Vgs) of the TSCM model 
(PMOS) compared with the Spice level 
3 model, for a 0.8 /i technology 



2.4 Static timing analysis 



TAS analyses the event graph to give the worst case delay between all circuit ter- 
minals (input connector, output connector or internal register). It is done by using 
a backward breadth-first algorithm to find both the longest and the shortest paths 
between two terminals. 




Figure 5 Flow graph of the timing analysis 



The design of multi-million transistor chip uses decomposition, which consists 
of breaking the design at a given level of hierarchy into components that can be 
designed and verified almost independently. No timing analyzer can handle very 
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complex circuits without the use of decomposition. The aim of the analysis made 
here is to handle the leaf block level of the hierarchy to be used by a hierarchical 
analyzer later (see Figure 5). In order to be fast enough we have chosen not to make 
a priori false path detection (Perremans et al. 1989, Benkoski et al. 1990), but to 
leave this detection for the detailed analysis of single paths if needed. 



3 RESULTS 

To illustrate the ability of TAS to handle several types of circuitry and several pro- 
cesses, timing analysis has been performed on 5 test circuits presented in Table 1. 
These circuits are chosen to emphasize particular difhculties of transistor level tim- 
ing analysis such as pass transistor and precharge domino logic. They have been 
designed using the symbolic approach and the ALLIANCE CAD system for VLSI 
design (Bazargan-Sabet e/ a/. 1994). The physical layouts corresponding to five dif- 
ferent CMOS processes (see Figures 6 and 7) have been generated and analyzed by 
both TAS and SPICE. 




Figure 6 Layout of an inverter for a 1.2 
H technology 




Figure? Layout of the same inverter for 
a 0.35 /i technology 



To estimate the accuracy of TAS, SPICE and TAS are executed on the same cir- 
cuits and comparisons are performed on computation times and computed propaga- 
tion times. To be able to compare propagation times resulting from SPICE and TAS, 
we have to find out the right input patterns that will activate the critical path for the 
dynamic electrical simulator, this is the reason why we have chosen small enough 
circuits. 

The propagation delays resulting from the execution of TAS and from H-SPICE 
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Table 1 Description of the circuit benches for comparison of TAS versus spice 



Circuit 
(Number of 
transistors) 


Function 


Size of operands: 
number of bits 


Main feature 
of the design 
(reference) 


alu 


arithmetical 


4 


DUAL CMOS 


(434) 


logical unit 




(AMD 1977) 


rsa 

(117) 


fast adder 


4 


xor with bleeder 
(Lucas era/. 1993) 


grog 


rom 


64 words 


precharge domino logic 


(368) 




* 1 bit 


(Greiner er a/. 1994) 


amg 

(1114) 


multiplier 


6 


complex CMOS cells 
(Royannez fl/. 1994) 


bsg 


barrel 


4 


N MOS pass transistors 


(270) 


shifter 




(Bazargan-Sabet er a/. 1994) 



(by Meta-Software) simulation are shown on Table 2. The error in TAS is defined 
as : 



100(^as — spice) jT AS 



( 6 ) 



In most of the cases, the results of TAS are within 7 percent versus SPICE. This 
shows that the TSCM model is relevant for a large range of deep submicronic tech- 
nologies. The worst results are obtained with the multiplier. Yet the multiplier ‘amg’ 
includes some complex CMOS cells as : 



a or {b and c) or {{not a) and {not h) and {not c) 



(7) 



TAS is pessimistic for complex cells because for a given input to output delay, it 
takes the worst case assumption regarding the other input patterns . 

TAS is must faster than SPICE. For example simulation of the multiplier ‘amg’ 
lasts 30 minutes with SPICE and 2 seconds with TAS. 

TAS has also been successfully used on high complexity circuits. The largest ex- 
ample studied with TAS with a decomposition approach, is AURJGA2 from BULL. 
AURIGA2 is a control processor unit on one chip, made of 4.7 million transistors 
with a CMOS 0.5/i 3 metal technology. The most interesting result is that TAS has 
been able to classify the electrical paths of the chip, regarding the commutation time. 
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Table 2 Critical paths (ns) of various circuits versus technology feature size 



Circuit 


Technology 


Voltage supply 


SPICE 


TAS 


error 




feature size (/im) 


(V) 


(ns) 


(ns) 






0.35 


3.0 


3.12 


3.11 


-0.3 


alu 


0.5 


3.0 


7.35 


7.59 


+3.0 




0.8 


4.5 


8.58 


8.58 


0 




1.0 


4.5 


9.80 


10.3 


+4.5 




1.2 


4.5 


16.2 


15.8 


-2.0 




0.35 


3.0 


1.7 


1.8 


+5.5 


rsa 


0.5 


3.0 


3.76 


4.02 


+6.0 




0.8 


4.5 


4.33 


4.41 


+2.0 




1.0 


4.5 


5.02 


5.50 


+8.5 




1.2 


4.5 


8.15 


8.29 


+1.5 




0.35 


3.0 


3.45 


3.78 


+8.5 


amg 


0.5 


3.0 


7.75 


8.48 


+8.6 




0.8 


4.5 


8.83 


9.20 


+4.0 




1.0 


4.5 


9.96 


11.1 


+10.0 




1.2 


4.5 


17.0 


18.0 


+5.5 




0.35 


3.0 


9.9 


10.1 


+2 


bsg 


0.5 


3.0 


2.01 


1.95 


-3.0 


0.8 


4.5 


2.32 


2.17 


-7.0 




1.0 


4.5 


2.44 


2.54 


+4.0 




1.2 


4.5 


3.76 


3.72 


-1.0 




0.35 


3.0 


1.89 


2.04 


+7.0 


grog 


0.5 


3.0 


3.50 


3.70 


+5.0 




0.8 


4.5 


4.14 


3.96 


-4.5 




1.0 


4.5 


4.87 


5.12 


+5.0 




1.2 


4.5 


7.89 


7.73 


-2.0 



4 CONCLUSION AND FUTURE WORK 

TAS is a static timing analyzer that gives accurate timing information for high com- 
plexity Asics. The analysis achieved by TAS with the TSCM model is relevant for 
optimized design using a wide range of submicronic technologies . It works success- 
fully on the leaf cell level of the hierarchy of high complexity circuits (millions of 
transistors). 

Our efforts now concern the hierarchical analysis as shown in Figure 5 taking into 
account the resistances of interconnections. 
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Abstract 

This paper presents the design and implementation of a time driven adder generator 
architecture. There exists a large variety of adders designed to satisfy different 
computation requirements, in particular we list the Carry Look Ahead (CLA) adder, 
the skip adder, the ripple adder, the carry select adder (CSA), etc. These different 
architectures will offer different delays and it is up to the user to chose among 
them. The design we present here allows the parametrization of the architecture to 
fit ones design constraints. From the word length and the wanted delay the 
generator outputs a suitable architecture. 

Keywords 

Addition VLSI architectures, generators, macro blocks, variable architecture. 



INTRODUCTION 

The exists a so large set of different adders generators (Sklansky, 1990), (Bedrij, 
1962), (Brent, 1982), (Cavanagh, 1984), (Hwang, 1979), (Muller, 1989), each one 
implementing a particular architecture, that the choice of an adder may be tough. 
We present here an alternative which will replace all the others. We impose the 
time delay criteria to the generator and this one will output the right adder (the 
architecture of the adder is thus variable). 
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Moreover, the need for such a generator is justified by the optimization of 
electrical power consumption and area. In fact, we find addition units in almost all 
complex circuits. For example, in a general purpose processor, certain adders are 
allocated for address computation, others are integrated in floating point units and 
most of the time linked to multipliers. The applications are various and the 
constraints (computation time, area, power consumption) vary from application to 
another. However, we find almost the same type of adders (the fastest) in such 
designs, even though this is not always necessary. For example, in the address 
computations, we need one clock cycle to carry out the operation, we can thus relax 
the computation delay requirement and use a slower adder. This will allow a certain 
gain in power consumption and Silicon area. The generators we designed allow us 
to fit exactly our performance requirements with the best optimizations possible. 

From the word lengths of the two operands and the computation time the 
generator outputs an adder in four different views (structural, behavioral, physical 
and placement) it also outputs a certain number of functional patterns. This is 
illustrated in figure 1. 




Figure 1 Generator output 

The generator is designed following a methodology developed at the MASI 
laboratory (Aberbour, 1995), (Houelle, 1994), and directly inherited from the 
silicon compiler approach (Johansen, 1979). 

This paper is composed of three parts. First of all, we review the different types of 
addition architectures (ripple, CLA, skip adder, ...). Then we select an architecture 
suitable for the desired delay. Finally we sum up with different VLSI results for a 
32 bits adder. 
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PRINCIPLE OF THE VARIABLE ARCHITECTURE ADDER 

All known adders are constituted of a tree of cells computing the generation (G) 
and propagation (P) signals in order to determine the values of the intermediate 
carries. The variable architecture only affects the way this tree is constructed. 

The architecture of the tree can in fact be in several forms, more or less parallel. 
To illustrate this we propose an example of an 8 bits adder in three configurations, 
as depicted in figures 2(a), 2(b), 2(c). 

(a) Ripple Adder 




|lsb~| 



Imsb 



(b) CLA Adder 



(c) Variable architecture adder 




Figure 2 Alternative adder architectures 

The first configuration, figure 2(a), computes the propagation and generation 
values in a serial fashion and represents a full sequential adder. 

This adder contains eight cells and its computation delay in the one of eight 
combinational stages. 

The carry anticipation adder, figure 2(b), uses a binary tree to compute the P and 
G signals. Its delay evolves logarithmically (Sklansky, 1990), three stages in our 
case. 
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The last configuration, figure 2(c), represents an intermediate adder between the 
full serial adder and the carry anticipation adder. It is built up of ten logic cells and 
presents a delay of four stages. 

The variable architecture adder generator is thus capable of generating a tree 
containing at the same time a parallel section and a serial section. 

For an N bits adder the generator outputs an operator with a number of stages 
varying from log(N) to N. 



THE MAIN ADDERS ARCHITECTURES 

We start the discussion with the ripple adder. It is the slowest ( delay in 0(N)), but 
it occupies the smallest area (in 0(N)). It is used where a very small area and power 
consumption in needed. 

Opposite to the ripple adder, is the Carry Look Ahead adder. It has a 
computation time of the order of 0(log2,(N)), the area is in 0(N*log2,(N)), the fact 
which makes this adder largest in terms of size. It is readily built in a recursive 
fashion and this makes it suitable for an implementation as a generator 
There exists a large set of architectures with intermediate characteristics. A skip 
adder architecture offers still better performances. 

We find also in the literature the adder with carry selection (Bedrij, 1962). It is 
broken down into several blocks. Each block carries out two additions in parallel, 
one anticipating a null carry in and another a 1. The result is then determined 
depending on the true value of the carry in. The delay is in o{'J2N) Ihe area 

is also in o{n - -sJW) 

THE CHOSEN ARCHITECTURE 

The addition architecture used is introduced by (Slansky, 1990). This operator 
allows the simple computation of the carry propagation and generation functions 

P^j , G^j, starting from position i up to position j. The properties of this operator are 
listed below 

• Associativity 

PG^j = ( PC^j A PG^j^.i ) A PG‘i.j = PC^j A ( PG^j^.j A PG^i,j) for jkl>k-l> i (1) 



• Idempotence 
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PG^j = P(^j A PG^ifor jkl>k-lk i (2) 

• Non Commutativity 

PG'y iif PG\.j A PC^jfor jikii (3) 

The most important property is the way the intermediate propagation and 
generation functions are computed 

PG) = PG)A,PGi-i W 

which corresponds to 

(Pj . G '• ) = (f/ • PI-i.G) + F/ • G 1 _ ; ) fo*' 7 ^ ^ ^ i> 0 (5) 



MSB LSB 

◄ 




CLA Section Ripple Section 



Figure 3 The configurable adder architecture 
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The chosen architecture must be modulable depending on the imposed computation 
time. This means that the length of the critical path must vary from a configuration 
to another. Moreover, since the generator must be able to generate adders with a 
propagation delay comprised between a CLA delay time and a ripple delay time, 
the implemented architecture is hybrid, in between the CLA (the fastest) and 
RIPPLE (the lowest) architectures. Now suppose that the delay constraint forces us 
to build an adder in which the critical path is constituted of k ^ cells; the adder 
will then be as shown in figure 3. 

We notice three distinct parts : Let n be the number width of the adder and k the 
number of stages (k represents also the number of cells of the Ripple part of the 
adder). 

The first group of x cells in parallel is placed between the position (n-x,k) and (n- 
l,k); X will be determined later on. The inputs to these cells are 

V/, n - 7 > I > n - X j and (6) 

The first values PG are generated by the group of ^ cells situated at the 
top of the previous group, precisely from the point {n-^l-Xyk-x+l) to the point (n- 

IMl 

It is then sufficient to specify the value of x and we get a basis to build the adder. 
We have seen that cells are placed from the point (1,1) to the point (n-X’l,k-l) 
included. Since these cells are cascaded, i.e. they are on a diagonal then 



k’l -n-x-1 which yields x^n-k (7) 

Unfortunately, there exists a limiting case which restricts the application domain of 
the algorithm. Since the cells of the second group start from position (n+l-x,k- 
x+1), where k-x+1 is the reference number of the stage, and the highest stage 
number is 1. Then we conclude that 

/:-jc+7>i with x=n-k we get k>n/2 or n^k (8) 

However, it can happen that this inequality be violated. In this case we apply the 
same process as to build a CLA adder. This means that we instanciate n/2 cells 
from (n/2+l,k) to (n-l,k), then we elaborate two adders of n/2 bits starting from the 
stage referenced by the number k-1. 
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RESULTS 

Three different VLSI comparisons have been carried out on a 32 bits adder and 
this for each configuration, meaning for every value of k varying from 5 to 31. For 
these tests we used the cells library ECPD07 of the ATMEL-ES2 company. 

First of all, we focused our comparisons on the routed circuit area. The automatic 
placement and routing have been done with the CADENCE tools. 




280000 

240000 

E 200000 

a. 

1 60000 
1 20000 



Number of stages 



Figure 4 Area of the routed circuit 

As shown in figure 4, and at a first glance, the curve seems not meaningful. 
However, we can distinguish two intervals. The first, for a number of stages less 
than 16 (=n/2), the area gain is very interesting, the curve is sharp. Elsewhere, the 
curve is flat and doesn't constitute an advantageous zone to find the best area-delay 
compromise for the adder. The curve indicates that the area decays exponentially 
when the architecture tends to become fully serial. 

Now lets focus on the propagation time results for the used technology, illustrated 
in figure 5. 

The curve is quasi-ideal because the delay grows linearly with respect to the 
number of stages. This proves that the delay grows as expected with respect to the 
number of stages. The messured delays for a 32 bits adder varies from 9 to 30 nano 
seconds. 
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Figure 5 Propagation delay comparisons 

Using the power consumption values of each cell, provided by ES2, we establish 
the maximal adder's power consumption which corresponds to the case where all 
input cells toggle at the same time, which is in fact almost impossible. This is 
illustrated in figure 6. 




Figure 6 Electrical power consumption 

The obtained curve is smooth and grows exponentially. Once more we can 
extract two distinct intervals: 

A sharp and fast growth part for a number of stages less than 16 (=n/2). 

And an almost sttaight line, representing a not really interesting power 
consumption-delay compromise. 
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CONCLUSION 

The main goal achieved in this work is the replacement of all possible adders 
generators by a generator with a parametrized time driven addition architecture. 
This is not possible only if we impose the addition computation time to the 
generator. In fact, the generator provides very good results since we can adjust very 
precisely the computation time, by using different numbers of intermediate 
combinatorial stages. 

The study of the curves representing the performances of the adders comes up with 
a conclusion that the compromise is optimal for a number of stages less than n/2, 
where n is the precision of the adder. In fact, the area and power consumption 
decrease very rapidly in this interval, whereas the delay grows slowly. This means 
that if we can tolerate increasing the computation time of the addition by about 
10%, this will be equivalent to increasing the number of stages by a few units, then 
we can achieve a power consumption and area gain of about 15%. Outside this 
interval (>n/2), the area and power decrease slowly, and this presents a negligible 
profit. 
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Abstract 

This paper presents a High Level Synthesis (HLS) method for specialized coproces- 
sors in embedded systems. In recent years, the synthesis of hardware systems has 
moved to a higher level of abstraction, but the existing tools leave very little initia- 
tive to the designer. With User Guided High Level Synthesis (UGH), we introduce 
the notion of Draft Data-Path Scheme (DDPS) which we consider an efficient way 
for the usCT to guide the HLS process. It describes the general structure of the data- 
path, without detailed information like signal-widths or physical implementation of 
multiplexers. Guided by these structural constraints, UGH intends to deliver a full 
data-path and a scheduled Finite State Machine that takes into account the detailed 
timing characteristics of the target physical library. 



Keywords 

HLS, Synthesis, Scheduling, Petri-Net, Co-Design, VHDL 



1 INTRODUCTION 

Several high-level synthesis systems have been proposed in the last decade. The input 
description is generally a procedural description, using a Hardware Description Lan- 
guage such as VHDL, and the output is a Register-Transfer Level architecture. Major 
steps include the operation scheduling and the functional resource allocation. How- 
ever, despite the need to increase the design productivity, these tools are not totally 
accepted in the industry. One reason may be sought in the scheduling strategy (Paulin 
etal. 1989,Gajski 1992, Biesenack e/ a/. 1993, Bergamaschi era/. 1993, Rahmouni 
et al. 1994) used by the existing tools. The major purpose of scheduling is to define 
the best possible serial/parallel tradeoff with respect to user-required performance 
and size constraints. Time and area are the two main optimization criteria. To solve 
this complex problem, the scheduling algorithms make the following assumptions: 
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1 . They rely on a characterized library of physical (or virtual) operators. For each op- 
erator, a static table gives the area and timing characteristics. The characterization 
of the latency of the operations is most often limited to the delay of the physical 
operator. In particular, the delays induced by the physical connections and regis- 
ter charactmstics are seldom taken into account. This issue becomes critical in 
submicron technology (Ramachandran et al 1992, Qiaiyakul et al. 1991). 

2. The constraints given by the designer to explore the design space are global con- 
straints such as defining the total number of busses, regist^ or arithmetic opera- 
tors. Such constraints are loo loose to let the user precisely guide the HLS tool. 

3. There are two ways to synthesize the conditional instructions. They can be imple- 
mented in firmware, in the control FSM, or they can be hard-wired by multiplexers 
in the data-path. Tools usually either micro-program them all, or hard-wire them 
all. The firmware approach is simpler, but generates a Mealy FSM with a com- 
plex output logic. It induces delays that are not properly taken into account by the 
schedulers, as opposed to Moore FSMs. 

Exisfing tools cope with some of those problems. For conditional instructions, 
CALLAS (Biesenack et al. 1993) allows the user to choose the conditions to be hard- 
wired, by direct specification of basic block boundaries. CATHEDRAL (Lanneer et 
al. 1990) and GAUT (Martin et al. 1990) remedy the second issue using directives 
included in the behavioral description that allow the designs to guide the HLS tool 
so precisely that it can generate the operative part prior to scheduling. To integrate 
the precise timing characteristics of the data-path into the scheduling, ALMA (Aug6 
et al. 1995, Brunei 1996) requires an explicit description of the data-path structure. 
This allows the scheduler to take into account not only the data dependencies, but 
also the exact physical delays and setup/hold times extracted from the layout. 

The User Guided High level synthesis (UGH) approach described in this paper 
tries to merge the three solutions presented above into a single tool. The designer 
specifies the data-path structure, using the Draft Data-Path Scheme (DDPS), which 
is a synthetic description of the target coprocessor architecture. Section 2 presents 
the general overview of the synthesis flow, and defines the supported VHDL input. In 
section 3 the proposed approach is illustrated on the classical example of the Highest 
Common Factor (HCF). The final section concludes with some perspectives. 



2 GENERAL OVERVIEW 

2.1 The Co-Simulation and Co-Synthesis Environment 

The User Guided High level synthesis tool presented here is part of the COSYS co- 
design environment for embedded system, developed at UPMC. In complex core- 
based embedded systems, where several processors or dedicated coprocessors have 
to communicate through a shared memory, the system level communication scheme 
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Figure 1 Generic Target Architecture. 





Figure 2 Main Scheme of UGH. 



is the critical point We don’t want to re-invent a new communication protocol for 
each new design, and we don’t want to synthesize the complex bus controllers that 
can be found in optimized libraries. So we defined a generic architecture (Figure 1. a) 
around the PIBUS (OMI 1996). Thus, the first component in COSYS is a library of 
PIBUS compliant modules: parameterized hardware bus controllers (in master and 
slave mode shown on Figure 1 .b), the corresponding software drivers, a MIPS R3000 
microprocessor core and several reusable macro-cells such as an interrupt controller, 
a PI/PCI bridge, . . . 

The second component in COSYS is the high speed simulation environment for 
hardwar^software co-simulation. PISIM (P6trot et al. 1997) is a cycle-based simula- 
tor that can simulate more than 100 000 MIPS instructions per second. The hardware 
components in the system can be described in C or VHDL, as long as they behave as 
synchronous FSMs. The cycle-precise PISIM simulator is used for a reliable perfor- 
mance evaluation of several possible architectural solutions in the hardware/software 
system-level partitioning along with functional validation. The third component is 
the high level synthesis tool UGH, presented in the remainder of the paper. 
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MODEL HCF (IN DIN; OUT DOUT) 
i 

DFF RA, RB; 

SUB a; 

EQ eq; 

a. A « RA.Q, RB.Q; 
a.B « RA.Q, RB.Q; 
eq.I ■ a.S; 
dout ■ a.S; 

RA. D « a.S, DIN; 

RB. D * a.S, DIN; 

} 

(a) File Sources 



w* 


M 




\ 




T — • ** 






DOUT 




— 





(b) Schematic Draft 




Figure 3 Draft Data-Path Scheme of HCF. 



2.2 The Co-processor Generic Architecture 

The synthesis flow is explicited in Figure 2. The generic architectural model of a 
control-dominated coprocessor is a data-path controlled by a Finite State Machine. 
The data path contains regi.sters, local memory and logic or arithmetic operators that 
are interconnected by busses or multiplexers. All these functional resources belong 
to a library of virtual, generic, operators that must be mapped to the physical cell 
library available for the target fabrication process. 

The User Input 

The main input is the behavioral VHDL description of the co-processor. It is a 
VHDL entity with a well defined interface, that generally communicates with a 
predefined bus controller (taken from the PIBUS module library) through a simple 
READ/WRITE protocol. This makes the coprocessor’s behavior fully independent of 
the complex multi-master PIBUS protocol. The VHDL description must be a single 
synchronous VHDL process; the only signal in the VHDL WAIT statements is the 
system clock. All external communications must be synchronized using the VHDL 
WAIT statement. The internal computation performed by the coprocessor can be a 
fully procedural description, including loops and conditional instructions, without 
any reference to the system clock. From a simulation point of view, it means that 
all the internal computation between two external input/output accesses takes only 
one simulated system clock cycle. At this stage, simulation cannot give any reli- 
able performance evaluation for the embedded system. The goal is only a functional 
validation of the coprocessor specification in the system environment. However, the 
designer can use an arbitrary number of WAIT statement to indicate an estimated 
number of cycles needed to perform the algorithmic calculation in order to have a 
better idea of global system performances. 

Once the VHDL model has been validated, the designer must provide a simpli- 
fied structural description of the target data-path called ‘Draft Data-Path Scheme’ 
(DDPS) to proceed further to the synthesis. It contains all physical registers, and 
all functional operators, but it is not necessary to describe explicitly multiplexers or 
tri state busses, and there is no detailed information about signal widths. All regis- 
ters instantiated in the DDPS must correspond to variables in the VHDL process. 
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entity HCF is 

port (CK. FULL. EMPTY: in bit; 

DIN : in integer; 

DOUT : out integer; 

REQ ; out bit_vector(l downto 0)); 

end HCF; 

architecture UGH of HCF is 

CONSTANT READ : bit.vector (1 downto 0):»"01" ; 

CONSTANT WRITE : bit .vector (1 downto 0):*«10« ; 

CONSTANT NOP ; bit.vector (1 downto 0):«"00“ ; 

BEGIN 

HCF : PROCESS 

VARIABLE RA. RB : integer; 

VARIABLE NUL, INF : boolean ; 

BEGIN 

REQ <* READ; — reading first operand 

WAIT on CK until ( CK * and EMPTY * '0' ); 

RA :* DIN; 

REQ <« READ; — reading second operand 

WAIT on CK until ( CK « and EMPTY * *0^ ); 

RB :« DIN; 

REQ <* NOP; — calculation of HCF 
NUL :» (RA « RB ) ; 

WHILE ( NOT NUL ) 

LOOP 

INF (RA < RB ); 

IF ( INF » ) THEN 

RB:* RB - RA ; 

ELSE 

RA;* RA > RB ; 

END IF; 

NUL :» (RA » RB ); 

END LOOP; 

REQ <* WRITE; — writing the result 
DOUT <* RA ; 

WAIT on CK until ( CK « ’1* and FULL * *0* ); 

END PROCESS; 

END UGH; 

(a) VHDL Description 




Figure 4 Highest Common Factor. 



and they must have the same name. The opposite is not true: it is possible to use in 
VHDL description variables that will not be synthesized as registers. It allows the 
designer to control the allocation and binding of the behavioral VHDL instructions. 



Architectural Rehneinent and Coarse Scheduling 

This first synthesis step is fabrication process- and physical cell library-independent. 
UGH performs the architectural refinement and binding, generating a detailed data- 
path, as well as a coarse scheduling, generating a first FSM controller. 

The detailed dala-path is a complete structural VHDL net-list, containing all nec- 
essary multiplexers, and control signals driven by the FSM. The signal widths are 
derived from the behavioral VHDL description. At this stage, the instantiated reg- 
isters, operators and multiplexers are still ‘virtual’ resources, because the physical 
mapping has not been done. The coarse grain controller is not the final, cycle-precise 
FSM, as in this phase zero delay assumptions are made for all operators. 
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Physical Synthesis and Fine Scheduling 

The role of fine grain scheduling and mapping of the Figure 2 is to generate the 
physical layout for a given cell library. This task is performed in the following three 
steps: 

Mapping: The data-path of generic operators is translated into a netlist of physical 
cells. During this step, a generic operator may be directly mapped to one physical 
operator in the target library, or to an equivalent netlist of standard cells. 
Characterization: The propagation delays and the setup/hold times for all oper- 
ators in the data-path are computed from the physical nedist. During this step, 
delays induced by the physical connections are taken into account. Routing ca- 
pacitances can be estimated or extracted from the layout after place and route. 
Scheduling: The basic-blocks are extracted from the ‘coarse FSM’. Then, each ba- 
sic block is scheduled for a given system clock frequency. The resulting fine-grain 
scheduling defines the final, cycle-precise FSM. 

The scheduling algorithm is ASAP taking into account the WAR (Write After 
Read) and WAW (Write After Write) precedence relations (Pangrle et al. 1987, 
Camposano 1991), and it supports operator chaining and multi-cycle operators. The 
fine grain scheduling algorithm is detailed in (Brunei 1996). The use of a character- 
ized data-path solves the first problem mentioned in the introduction. The two main 
outputs, physical data-path netlist and cycle-precise FSM controller, are VHDL de- 
scriptions, directly usable by the back-end tools. The third output is a cycle-precise, 
high speed simulation image for the PISIM simulator, which can be used for cycle- 
precise performance evaluation at system level. 



3 ARCHITECTURE REFINEMENT AND COARSE SCHEDULING 

The use of the ‘Draft Data-Path Scheme’ in the synthesis process is illustrated on 
a classical example: The Figure 4. a shows the VHDL behavioral code for a slave 
coprocessor that performs the computation of the Highest Common Facter of two in- 
tegers. In this example, the communication between the coprocessor and the PIBUS 
conffoller is based on two separate FIFOs. The two integers are read sequentially on 
two clock edges. The core of LOOP does the main calculation. The result is written 
to the output FIFO on a ckx:k edge. The Figure 3.a shows the DDPS corresponding 
to the example of the Highest Common Facter on Figure 3. b. Note that the two vari- 
ables used as registers arc clearly identified both in DDPS and the VHDL behavioral 
description. 



3.1 IVanslating behavioral VHDL into Petri Nets 

In the first phase the VHDL is compiled into a formal model based on Interpreted and 
Timed Petri Nets (ITPN) (Encrenaz 1995, Bawa 1996). The Petri Net (Murata 1989) 
represents the control structure (CDFG) of the VHDL processes. An external data 
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if (cond) then 


if (cond) then 


if (cond AND WIRED) then 


R :« A - B; 


R :« A - B; 


R :■ A + B; 


else 


else 


else 


R :■ B - A; 


R :« C + D; 


R ;» B - A; 


endif ; 


endif ; 


endif ; 


(SI) 


(S2) 


(S3) 



Figure 5 Some conditional statements. 



FSM 




(») 0 ») 



Figure 6 Minimizing the Number of Multiplexers. 



part of the Petri Net contains the data modified by the firing of transitions in the 
control part. 

Each process is composed of places and transitions. Places refer to the states of 
the process and transitions are fired to pass from one stale to the next, representing 
the VHDL statement executed between these two states. Transitions are split into 
two disjoint sets. Those modeling VHDL wait statements belong to the RES set, 
they are only firable during the RESUME phase of VHDL delta cycles. All other 
VHDL statements are represented by EXE transitions, which are firable during the 
EXECUTE phase of VHDL della cycles. 

Interactions between the control part and the data part takes place while transitions 
are fired. These interactions are represented by means of attributes associated to each 
transition, L of the Petri Net : 



• g{i) is the guard of transition t : t may fire only if its guard is true. g{t) is a 
boolean function of data contained in the data part. 

• ASG(t) is the set of data modified while firing transition t. 

• TRF{t) is the set of transformations applied to the data in ASG{t), TRF{t) is a 
set of couples (d, f ) where d e ASG(t) and trfd,t is a function of data in the 
data part. 



This formal model of ITPN has been used with success for several applications 
notably model checking (Bawa-b et al 1996), behavioral equivalence (Bawa-a et 
aL 1996) and behavioral synthesis (Bawa-a et al 1996). The ITPN generated from 
the VHDL code of HCF is shown on the Figure 4.b. 
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FSM 




(•) (b) 

Figure 7 Allocation for hard-wired conditional statement. 



Binding.ITPN (/TPN, DDPS) 

{ 

if • Consistency. CheckC/TPA^, DDPS) 

abortO; // DDPS inconsistent with ITPN 

Let IZ « set of the registers in DDPS 

A - set of assignations to elements of 7Z in ITPN 
Ah * set of hard-wired assignations in .4 
« set of micro-coded assignations in .4 
Such that A = Ah U .4m with .4;^ O .4m = 0 

For each Sig £ TZ : { 

For each Assign £ Ah with IZ. * target ( 4 sstpn) ; { 

Balance.branches ( /TPN, TZ, Ah) 

Let Ar * of parallel assignations to TZ 

Let Op » set of (sets of) operator (s) from DDPS performing Ar 
OptR « Find.Optimal.Element (<!),,) 

DDPS * BindC'P., OptR, Assign, DDPS) // Assign to R will be done by OptR 
if here we incrementally update the DDPS into a data-path 

} 

For each Assign £ A,n with TZ » Assign) ; { 

PathR ■ Get.Possible_Paths(7^, .4,n t DDPS) 

OptPathR ■ Minimal.Cost (Po</»P, DDPS) 

t i minimal communication and most adequate operators 
DDPS « Bind(7^, OptPathR, Assign, DDPS) 

} 

} 

Return (DDPS) // it is now a data-path 

} 



Figure 8 Allocation Algorithm. 



3.2 Allocation and Scheduling 

Allocation starts with a consistency check (see algorithm on Figure 8): All the oper- 
ations required by the behavioral description must correspond to a possible transfer 
in the DDPS. Also enough registers must have been instantiated to store all the non- 
trivial variables. Qassical compiler methods adapted to the formalism of ITPN and 
to the signal/variable .semantic differences are used to strip off variables that are ju.st 
place-holders and then check the availability of registers (Bawa 1996). 

To build the coarse FSM, along with the data path synthesis, the fields of the 
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Micro-Instruction Register (MIR) are defined. There is one field for each controlled 
resource in the Data-Path. One state in the control FSM is generated (with the transi- 
tion function) for each transition in the Petri Net Let us see in more details how the 
data-path synthesis is done for some simple, yet demonstrative, examples. 

The first step in the allocation is to determine which operators could be used to 
perform a given transformation in the data space. This is done by a backward explo- 
ration of the allowed paths in the DDPS, starting at the target register (R), through the 
possible operators (SUB), and ending at the input data (A,B). These possible paths 
are shown on the left-hand sides of figures in this section. Whenever an operator has 
to be shared between several ttansfers, we solve the conflict with multiplexers on its 
inputs. The operator-to-transfer binding is done in a basically greedy fashion with 
some extra rules. 

Firstly, we evenly balance transfers on operators, so as to minimize the amount of 
communication required (for instance, when binding behavior (SI) with the DDPS 
of Figure 6.a we choose to use both substracters and multiplex their ouQiuts thus 
sparing 1 multiplexer). 

Secondly, when dealing with multiple-transfer operators, like an Adder-Substracter 
in the example for (S2) on Figures, a merely greedy allocation of operators could 
first associate ASB to A-B, then force the reuse of ASB for C+D. The rule that avoids 
this suboptimal allocation is that we always try to allocate the most specific operator 
that matches the requirements of an operation (so SUB is associated to A-B instead 
of ASB). 

The statement (S3) on Figures illustrates our solution to the third problem men- 
tioned in the introduction: the user guides the choice between firmware and hard- 
wiring for conditional instructions. In case of a WIRED request, the algorithm checks 
that assignations (from set Ai, of Figure 8) on all execution paths of the branch are 
balanced (i.e. the same variables are assigned in all the branches) and that there are no 
data-dependencies inside the branches. The balance is necessary because all register 
WRITE-ENABLE signals must be controlled by the FSM. The independence is nec- 
essary in order to keep the scheduling manageable. Note that an improper balancing 
can be corrected automatically (by adding pseudo-assignations like Var : = Var ; ) 
but dependencies can only be corrected by the designer (by rewriting the branch or 
alleviating the sttuctural constraints). 

The DDPS given for the synthesis of (S3) is shown on Figure 7.a. Here the rule is 
to allocate the operator, if any, that matches all the operations targeted for the register 
R: we allocate the ASB and spare the SUB (Figure 7.b). This prioritizes code-op 
wiring over multiplexers, for best data-path performance. The resulting algorithm is 
summarized in Figure 8. The final Data-Path generated for the HCF example, starting 
from the DDPS exhibited on Figure 3.b, is shown on Figure 3.c. 
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4 CONCLUSION 

The UGH project is the direct follow-on of the ALMA synthesis tool, that has been 
developed in the framework of a cooperation between the LEP laboratory (PHILIPS) 
and the MASI laboratory (UPMC) (Augd et al. 1995, Brunei 1996). A first soft- 
ware implementation is under development at UPMC. Several critical modules ex- 
ist (including the VHDL to Petri Net compiler, and the fine grain scheduler), but 
the complete synthesis flow is not yet available. The first experimentations are very 
promising, but there are possible improvements such as automatic loop unrolling or 
unfolding. 

UGH requires a DDPS input, which seems somewhat contradictory with the no- 
tion of high-level synthesis. Other HLS tools do not use such a feature. Nevertheless 
CALLAS (Biesenack et al. 1993) and CATHEDRAL (Lanneer et al. 1990) expect 
more information from the designer than Just the pure behavioral description, to drive 
the synthesis process towards a specific target architecture. In HIS, the order of con- 
current instructions is crucial. In the SYNOPSYS behavioral synthesis tool, the de- 
signer identifies the registers by precise and specific VHDL templates. The DDPS 
actually plays the same role but in an explicit form rather than through hidden added 
semantics in tlie behavioral description. UGH is conceptually a HLS tool, but it gives 
the designer a much better control over the synthesis process. 

• The design process starts from a synchronous VHDL behavioral description, that 
supports fast, cycle based simulation at system level. The resulting, cycle-precise 
synthesized coprocessor can be simulated in the same system environment for 
reliable performance evaluation. 

• The designer has a total control on the data-path structure, which is the condition 
for efficient optimization in practical cases. UGH does not limit the designer’s 
skill; One can obtain the same result as a classic custom design by gradually 
refining the DDPS towards an explicit data-path. 

• The coprocessor cycle lime is not a result of the synthesis process. It is an external 
constraint defined by the designer. The timing behavior of die physical synthe- 
sized coprocessor is guaranteed by construction, which avoids design iterations. 

• Last but not least, DDPS provides a safeguard mechanism against the temptation 
to write non-synthesizable behavior. Forcing the designer to propose a consistent 
DDPS, makes him aware of the architectural complexity of the target coprocessor. 
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Abstract 

As device geometries continue to get smaller delay and fault specification for 
deep-submicron technologies is evolving into a quite complicated task. The 
cumulative effect of gate delays are now less important than the interconnect 
delays, particularly in chip level interconnections. Potential crosstalk effects must 
also be taken into account. In this paper we concentrate on the crosstalk effects 
between neighboring wires in deep submicron interconnection structures and its 
characterization for use in higher levels of the design hierarchy. Realistic models 
were used and HSPICE simulation results are presented. Empirical generalization 
between experimental data and estimated crosstalk potential is performed. 



Keywords 

crosstalk, deep submicron, estimation, interconnect templates 



This work was funded by CAPES/MEC under grant 0576/94 



VLSI: Integraled Systems on Silicon R. Reis & L. Claesen (Eds.) 
©IFIP 1997 Published by Chapman & Hall 





Emperical interconnect crosstalk characterization 



All 



1 INTRODUCTION 

The expression “deep submicron” usually refers to technologies in which the 
smaller geometric features are sized below 0.6 pm, i.e., in these technologies the 
lithographical mask generation stage of the fabrication process has enough 
resolution to generate layout primitives whose minimum dimensions are below 0.6 
pm. 

It has been found that the physical characteristics of these technologies 
considerably complicate the design process (Watts, 1989). Although the logic gate 
delays have been substantially reduced because the transistor capacitances inside 
the gate cells are smaller, the interconnection delays have increased in relation 
with the gate delays. This is due to the fact that wires with smaller cross section 
naturally have bigger resistance. Even though the area capacitances tend to be 
smaller in sub-micron technology the closeness between wires tends to increase 
inter-wire capacitances. 

Furthermore, the crosstalk interaction between different wires in buses, for 
example, or between wires in different layout layers also influence the way the 
data is transferred within the chip. Since device geometries have shrunk, so have 
the minimum distance between adjacent conductors, increasing the interaction 
among those conductors by means of stray capacitances, resistance and inductance. 
Glitches are more likely to occur in these conditions and therefore fault analysis 
strategies must be implemented. Timing optimization schemes that do not take 
into consideration these effects can fail to generate a working chip at the first 
design iteration, forcing the design team to reassess its decisions in the earlier 
stages of design. The main consequence is, of course, a longer design cycle time 
with subsequent delay in reaching the market before the other competitors, with 
obvious loss of revenue. 



2 BACKGROUND 



Several techniques have been proposed in the literature for the analysis of VLSI 
interconnects and coupled microstrip lines (Rubio ,1994), (Zhang, 1992), (Liaud, 
1994), (Xie, 1993), (Achar, 1995), (Eo, 1995). The estimation of area and timing 
has been addressed by various papers (Ramachandran, 1994), (Ramachandran, 
1992), (Kaptanoglu,1996) as well as the linking between physical and higher 
abstraction levels (Jha, 1994), (Kurdahi, 1993), (Mintz, 1994). A better 
characterization of crosstalk, delay and transmission line behavior in interconnect 
wiring is necessary to reduce the hiatus between estimated timing and the final 
physical implementation (Rabaey, 1996). In the following item we present some 
background information about interconnect behavior. 




478 



Part Eleven Architectural Design and Synthesis 



Interconnect Delay 

The increase in the number of gates and RTL components in a single chip indicates 
that the average interconnect length also increases. The wiring length L for an 
average net in a circuit with area A can be approximated by 

L = fU = ^ ( 1 ) 

There are some conclusions that can be drawn if there is a function in which L 
increases with A. The resistance and capacitance of a line are all proportional to 
wire length (Rabaey, 1996) which in turn is proportional to A. In addition, signals 
have to travel larger distances because the die sizes are getting larger and delays 
and crosstalk associated with transmission line effects also tend to get larger 
(Watts, 1989), (Sorkin, 1987). 



3 DESIGN METHODOLOGIES IN DEEP SUBMICRON DESIGN 



Design methodologies are expected to change in view of the greater importance 
and influence the physical level has in the logic and behavioral levels. The 
synthesis of a chip based on early design exploration must now take into account 
the topological and lossy characteristics of buses. 

Traditional design synthesis 

Traditional design synthesis methodologies are those which do not take primary 
care of the interconnect impact in the timing analysis and logic optimization. In 
these tools the primary contributors of delay in any logical path in the design are 
assumed to be the gates, not the wires or interconnects. This methodology consists 
primarily of the stages described in Gajski (1994) 

After high-level synthesis a register transfer level (RTL) netlist is obtained as well 
as boolean functions representing each of the RTL components. This netlist serves 
as one input to the logic synthesis stage. During the logic synthesis the input is 
converted into a netlist of gates which are then placed in a floorplan and routed. 
Physical characteristics of the circuit are extracted from the layout and then the 
circuit is simulated again with enhanced physical and delay information. If the 
simulation indicates that the design is working reliably for a set of timing and area 
constraints then the layout can be sent to the foundry. If not, the designers must 
reevaluate the circuit at the logic level and modify some of the critical paths to 
meet the timing and/or area constraints. The design may not converge after several 
iterations in the cycle described above. In this case the weight of the physical delay 
information is of such influence in the back-annotated simulations that the design 
usually does not meet all the timing constraints. 
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Deep submicron design considerations 

The design process described above has a fundamental flaw: it does not take 
detailed interconnect information into account when doing scheduling, allocation 
and binding. The high-level synthesis process is, therefore, void of appropriate 
data concerning bus data transfers and wiring delay in general. One way of 
obtaining some physical information in an early level of design is to perform 
floorplan estimation. The floorplan estimator should be able to feedback some 
interconnect information to the high-level synthesis stage. In this case we would 
end with a set of tools like the one in Figure 1. 

This way significant wiring delay can be considered in the front end and used as 
any of the other elements such as functional units, storage etc. Thus, we need a 
suitable high-level interconnect model characterized as a function of its delay, 
area, bit-width and crosstalk. The following items present some basic aspects of 
deep submicron interconnects. 




Figure 1 High-level synthesis system with enhanced interconnect information 
Interconnect behavior 

The delay of a bus wire can be considered to be dependent on its capacitance, 
resistance and inductance. Each of these three parameters have effects on the 
interconnect behavior because they may induce noise, like crosstalk, which 
reduces the reliability and fault-tolerance of the circuit, and increase the 
propagation delay. 

The propagation delay is proportional to the wire capacitance as expressed by 



f _ Sj2l.(J-+ J—\ 

2V.J p/ 

C„„ = L[(^)+0.77 + 1.06(^r^+ 1.06(^r] 

t ox t ox r ox 



( 2 ) 



( 3 ) 
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where L is the wire length, W is the wire width, H is the wire thickness, t^^ is the 
oxide thickness, e^^ is the permittivity of the oxide and and pp are the betas for 
the N and P transistors respectively. The above equation includes fringing 
capacitances (Rabaey, 1996). The interconnect resistance can be expressed by: 

R = (rho,L)/(H.W) (4) 

where rho, L , H and W are the conductor’s resistivity, lenght, height and width. 

Interconnect wires exhibit inductive parasitics as present in bonding wires and chip 
packages. A changing current propagating through an inductor generates a voltage 
drop. The magnitude of the voltage drop (AV) for an inductance L and change in 
current (dl/5t) is: 



AV=L(6I/5t) 



(5) 



The wave propagation equation in a lossy transmission line can be expressed by 



d dv , d^v 

-T — r= rc — + Ic-^ 
d dt St^ 



( 6 ) 



as shown in (Rabaey, 1996). 

Crosstalk factor 

Crosstalk is caused by capacitive and inductive coupling between neighboring 
interconnect lines. Another source of crosstalk among circuits is due to common 
impedance in ground loops. The crosstalk between different lines in a bus must be 
taken into account in order to preserve data integrity throughout the synthesis 
process. It has been shown [1] that circuit wires coupled by parasitic capacitances 
can reach a level of interference high enough to affect the signal reliability, i.e., the 
voltage level variations in neighboring wires can affect each other to the point in 
which the logic information in these wires is changed. Therefore, crosstalk must be 
seen as a potential fault condition in submicron IC’s. 

Crosstalk coupling is defined as the ratio of the power in a disturbing line to the 
induced power in the disturbed line. Consider two bus lines adjacent to each other, 
Linel and Line2. The amount of crosstalk between them is expressed in dB by: 

Crosstalk (linel, linel) = 10 log (P2 /PI) = 20 log ( V2 / VI) (7) 

where VI is the voltage in the disturbing wire and V2 is the voltage in the 
disturbed wire as induced by capacitive, inductive and resistive coupling. 
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Based on expression (1) we can define a crosstalk factor Crosstalkp^^^,^ 
(Linel,Line2) = f(AV 2 , AVj) in which the interference between line 1 and line 2 is 
expressed by the difference in their amplitude module: 



Crosstalk,^JlineUine2)=20log(V2„^-V2^JVl„^-Vl^J (8) 



where Vl^,p is the maximum voltage for line 1, VI is the minimum voltage for 
line 1, V2„p is the maximum induced voltage for line 2 and v2^^^ is the minimum 
induced voltage for line 2. The crosstalk factor, also expressed in dB’s, has 
negative values, indicating the atennuation between wires. 

In order to characterize the crosstalk potential for high level interconnect 
descriptions we assume that the crosstalk is proportional to the thickness of the 
conductor, the length of the interconnect wires and inversely proportional to the 
spacing between conductors. A simple empirical expression for the crosstalk 
potential is 



Crosstalkpotentiai = K ( L/Lsafe ) • (TH / SP) 



(9) 



where L is the wire length, is the length considered to be ideal for crosstalk 
avoidance, TH is the wire thickness, SP is the spacing between conductors and K is 
a constant related with the technology being used and B is the linear coefficient for 
the interpolated function obtained from experimental data. Using this simple 
expression we can classify interconnect buses regarding their potential for 
crosstalk interference. 

Our objective is to find a function Crosstalk^^^,^,^ = fiCrosstalk^^^^^^J that 
approximates the actual experimental data as calculated in (8): 



Crosstalk estimate = f( CrOS Stalk potential ) - CrOSStalk Factor ( 10 ) 



4 EXPERIMENTAL RESULTS 

We address the crosstalk in two parallel lines using the following technologies: 0.6 
|im and 0.25 |im. The circuit topology is shown in Figure 2. The circuit was 
simulated using Hspice level 3 transmission lines and extracted Hspice netlists 
from Compass Tools standard cell libraries. The cells were extracted from layout 
using the Extract2 tool and then converted to Hspice netlists using the Netlist 
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utility. The transistor models used in Hspice were of level 13. In out first example, 
both lines are active, driven by pulse voltage sources of varying frequencies. After 
a certain time, the tri-state buffer is disabled so that the output of the buffer 
connected to the input of the disturbed transmission line goes into a high 
impedance state. The crosstalk then is observed between those lines. 




Inverters: IX 
Tri-state buffer: IX 

transmission line lossy model 
in sea of dielectric 

transistor MOS level 13 



Figure 2 Test Hspice circuit with subcircuit and model descriptions. 

By modeling the source and destination impedances as detailed extracted gate of an 
actual library the simulation results are more realistic than those observed in other 
references where the drivers and load are usually modeled as a pair 
resistor-capacitor. 

The simulations take into account three varying geometric parameters: wire length, 
thickness and spacing. In the first batch of results we have three different line 
lengths (500 pm, 1000 pm and 1500 pm) and two thicknesses (1 pm and 2 pm) for 
a 0.6 pm process. The simulation results, shown in Table 1 clearly show that the 
greater the thickness, the worse the crosstalk. In all cases we used expression (8). 
For a 0.25 pm process the results are shown in table 2. These results show that 
smaller feature size technologies are indeed prone to hazardous interconnect 
crosstalk, as previously stated. 

Of particular importance is the behavior of the high impedance wire. If the tri-state 
buffer’s output is switched into high-impedance when the line itself is in transition 
from one logic level to another, the average level of disturbance can fall exactly in 



Table 1 Crosstalk Factor (dB) for 0.6 pm technology 



L(ftm) 


500 




1000 




1500 




TH(nm) 


1 


2 


1 


2 


1 


2 


SP(im) 


1.8 3.6 


1.8 


1.8 3.6 


1.8 


1.8 3.6 


1.8 




-15.1 -17.9 


-14.1 


-15.3 -19.5 


-12.5 


-15.2 -19.8 


-13.2 








Emperical interconnect crosstalk characterization 



483 



Table 2 Crosstalk Factor (dB) for 0.25 ^im technology 



L (jjm) 


500 




1000 




1500 




TH(im) 


0.75 


1.5 


0.75 


1.5 


0.75 


1.5 


SP(iim) 


0.75 1.5 


0.75 


0.75 1.50 


0.75 


0.75 1.5 


0.75 




-8.6 -13.2 


-6.8 




-6.2 


-7.4 -11.4 


-5.82 



the threshold region of the destination gate (in our case an inverter), causing erratic 
behavior in its output due to crosstalk as shown in Figure 3. It should be noted that 
one of the functions of the output gate would be to restore the signal arriving from 
the interconnect to a strong and defined logic level. In the simulation results all of 
the three interconnect lengths have similar crosstalk factors but the only one not to 
have the output significantly disturbed is the 500 |xm wire. The other two show 
severe voltage level disturbances, even after the destination gate output. We also 
performed simulations using with one of the wires in “quiet” mode, i.e., the 
disturbing wire is switched while the disturbed wire is set at a constant voltage 
level in its input, in our case 0 volts. In this case the voltage sources in Figure 2, V2 
and V3, were set to DC operation at 0 and 3.3 volts, respectively. The simulations, 
as expected, show both the influence of wire thickness and spacing. 




Figure 3 Simulation results for 0.6 pm technology with SP=1.8 pm, TH= 1 and 2 
pm L=1500 pm. The top window has the waveforms for signals in the “active” 
wire and the bottom window has waveforms for the “passive” wire. After 2.5ns 
the tri-state buffer is deactivated and the the wire goes into high impedance. N02 
and OUT2 are indicated in Figure 2. 
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Crosstalk rating 

Since our objective is extracting enough information to be used in higher levels of 
design without jeopardizing the main advantages of higher levels of abstraction, 
i.e., faster design exploration, we use some of the data obtained during this 
simulation to create a sample of interconnection elements with additional crosstalk 
information. 

By organizing the interconnect crosstalk ratings as in table 5 we can see the relation 
betweeen the Crosstalkfacwr and the Crosstalkesumate- For example, in the case of a 
0.25 pm technology, for a length L=500 pm, and Lsafe= 500 pm, TH=0.75 pm, 
SP= 1.5 pm we would have the following expression for the crosstalk potential: 



Crosstalk Potential = (L/Lsafe )(TH/SP) = TH/SP = 0.5 



As a first approximation we could use an expression of the form y=k.x' to 
characterize the crosstalk factor Crosstalk as a function of the crosstalk 
potential Crosstalk Let x=crosstalk k a constant which depends on the 
experimental data and y the crosstalk factor Crosstalk From algebraic 
manipulations we find that, for this particular technology, c = 0.208 and k= 14.9. 

Therefore, we have: 

Y=K(Xr 

if y = Crosstalk estimate and X = Crosstalk potential = (L/Lsa/e )(TH/SP) then 



Crosstalk estimate- ^ ((L/Lsafe)(TH/SP)f 



( 11 ) 



or 

Crosstalk esumote = 14.9 (TH/SPf^°^ 



since L = L in this case, with some of the points shown in table 5 for TH=0.15 
and 1.5 pm, and 5P=0.75 and 1.5 pm. The above expression is a function of the 
crosstalk potential and was extracted for this technology by interpolating the 
experimental data. 

In Table 5 we show the relation between the estimated crosstalk {Crosstalk as 
defined in (11) and the experimental data, expressed by the module of the 
Crosstalk as defined in (8). By analyzing this table, we notice that the 
estimated data has good correlation with the HSPICE simulation results. 
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Table 5 Crosstalk estimate vs ICrosstalk factorl (measured data) (dB) 



L=500 fun 


L=1000 pm 


L= 1500 urn 


Crosstalk 

Estimate 


1 Crosstalk 
factor 1 


Crosstalk 

Estimate 


1 Crosstalk 
factor 1 


Crosstalk 

Estimate 


1 Crosstalk 
factor 1 


17.21078 


20 


23.65006 


20.4 


26.39016 


20.8 


14.9 


14.9 


18.3 


18.3 


20 


19 


13.69491 


13.9 


15.75061 


16.2 


17.00566 


17.8 


12.89947 


12.9 


14.16022 


14.1 


15.15717 


16.2 


12.31444 


11.9 


13.03807 


13.45 


13.8629 


14.7 


11.85619 


10.9 


12.18754 


12.8 


12.88788 


13.3 


11.16755 


8.8 


10.95693 


11.5 


11.48698 


12 


10.26433 


4.9 


9.430509 


8.9 


9.767187 


9.8 



This approach, which takes into consideration physical process information, can 
perform a good estimation of the crosstalk effects in the technologies used here. 

This is one stage in an ongoing research effort aimed at characterizing interconnect 
behavior for high-level synthesis system by means of interconnect templates. Some 
of these interconnection modules would use this information in the following 
manner: 



Component = IT (delay, shape, crosstalk 



where “IT” stands for Interconnect Template”, delay is as defined in expression 
(2), shape is the size of the interconnect bus which is a function of wire width, 
spacing and the number of wires, and crosstalk is the rating defined as in (1 1). IT’s 
are the subject of an upcoming paper. More detailed experimental data can be 
found in the unabridged version of this paper in http.V/jblevins.ics.uci.edu/. 
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5 CONCLUSIONS 

In this paper we presented an empirical model for high level classification of 
interconnects. Our approach is to create an interconnect component library with 
interconnect cells classified, or rated, according to their physical behavior 
regarding crosstalk, delay and shape. These interconnect cells are to be used in 
higher levels of the design cycle so as to improve its convergence and therefore 
shorten its turn around time. In other to give an example of this rating we build a 
sample library of interconnect components based on 0.6 pm and 0.25 pm 
technologies. 
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Abstract 

This paper presents new methods for high- and system-level synthesis based on 
transformation of a behavioral description to a special behavioral model, 
development of new and modification of existing synthesis techniques for this 
model, development of net-based synthesis model and techniques extending target 
architectures. VHDL-based high-level synthesis tools running on an IBM PC 
platform are described. 
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1 INTRODUCTION 

Effective high-level synthesis systems include ALERT, AMICAL, CATHEDRAL, 
CMUDA, DAA, ELLA, FACET, HAL, HIS, MAHA, MIMOLA, PSAL2, 
Yorktown Silicon Compiler, and others (Camposano,1989, Courtois,1994, 
Gajski,1992, 1994, Goossens,1989, Jerraya,1993, Mcfarland,1990, Mermet,1993, 
Vercest, 1990). They develop synthesis methodology consisting of the following 
tasks: 

• compiling a behavioral description presented in a hardware description 
language to an intermediate format 

• generating the control (CFG) and data (DFG) flow graphs 

• scheduling the description operators and statements 

• allocation of the functional, storage, and interconnection units 



VLSI: Integrated Systems on Silicon R. Reis & L. Claesen (Eds.) 
©IHP 1997 Published by Chapman & Hall 




Methods and tools for high and system level synthesis 



489 



• binding the behavior constructions to the units 

• generating the data path (DP) and finite state machine (FSM). 

These systems usually accept the behavioral description as it was defined by a 
designer, therefore the description expresses rather a designer's point of view than 
target architecture requirements. The synthesis tasks to be solved are very complex 
combinatorial problems in this case, and it is very difficult or impossible to find 
the optimal design. Natural extensions of high-level synthesis are system level 
synthesis and low power asynchronous circuit synthesis (Asynchronous, 1994, 
Mermet,1997). 

This paper presents new methods and tools for high- and system-level synthesis 
that explore behavioral description transformation, synthesis techniques based on 
orthogonality analysis, and net-based synthesis techniques. Section 2 introduces 
the concepts underlying the new synthesis methodology. Section 3 describes 
behavioral description transformation rules that explore a special behavioral 
model. Scheduling methods are presented in section 4. Section 5 presents 
allocation and binding methods for the model. Section 6 describes the key concepts 
of net-based synthesis methodology. The results for the AHILES high-level 
synthesis system appear in section 7. 



2 SYNTHESIS METHODOLOGY 

The methodology is based on the following three main principles: 

• transforming the source behavioral model to a special model, allowing 
efficient synthesis of high quality RTL-structures 

• development and use of new analyzing, scheduling, allocation, binding, data 
path generation, and finite state machine generation techniques that explore 
the special model advantages 

• extending traditional high-level synthesis methodology in order to automat- 
ically design and optimize asynchronous circuits and systems. 

Behavioral description transformations have been locally used in several high- 
level synthesis systems (Camposano,1989, Gajski, 1992, 1994, Jerraya,1993, 
Mcfarland,1990, Mermet,1993). This methodology employs transformations to 
obtain the preliminary defined special behavioral model that speeds up the design 
process and allows generating faster and cheaper designs for the same constraints 
on design parameters. Various HDLs constructions introduced to represent a 
behavior have been investigated to find a representation that increases design space 
exploration freedom and allows development of efficient scheduling, allocation, 
and binding techniques. It has been found out that the main restrictions on the 
order of computations are implied by CFG. The main idea of the special model is 
to replace the order implied by CFG with a weaker order implied by DFG. This 
can be achieved by splitting control structures into many separate parts connected 
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through data dependencies. The developed transformation rules increase freedom 
for reordering statements and extend the design space. 

Synthesis from the special behavioral model The special behavioral model 
decreases the influence of control structures on the order of calculations. The order 
is defined through using orthogonal relations introduced for Boolean and bit 
signals and variables, if-then statements, and operators. The orthogonal statements 
can share the same functional unit and execute in one FSM state concurrently. Due 
to the orthogonality relations, the number of statements executed in one HLFSM 
state increases and the number of states decreases. Techniques which perform 
variable lifetime analysis and calculate the statements precedence and 
compatibility relations are modified to account for the special model and 
orthogonality relations. Due to introducing the probability for a variable to take 
value true (1), the overall execution time under constraints on resources is 
estimated and minimized. Scheduling, allocation, binding, and DP and FSM 
generation methods are modified and extended in order to 

• explore orthogonality between statements 

• minimize the execution time mathematical expectation. 

Net-based high- and system-level synthesis. Net-based synthesis methodology 
extends the set of target architectures. It constitutes a theoretical and practical basis 
for design of asynchronous circuits and systems. The key concept of this 
methodology is a net schedule which concurrency level, execution time, and cost 
are defined through the set of concurrent statement pairs. Net-based scheduling, 
allocation, and binding methods and techniques explore the concurrency space and 
generate net schedules for given constraints on design parameters. The net 
schedule can be used for synthesis of circuits and systems directly or can be a 
source for generating sequential schedules. 



3 BEHAVIORAL MODEL TRANSFORMATION 

Source behavioral model The behavioral specification is described in VHDL 
(IEEE, 1988). Modelling the behavior is in many aspects the same as in the 
traditional high-level synthesis systems (Mermet,1993). The behavioral description 
is presented by object declarations, process, signal and variable assignment, if, 
case, loop, exit, next, and wait statements. The design specification may be 
composed of several VHDL units and libraries. 

Special behavioral model The special behavioral model is described through using 
a subset of VHDL statements. The wait, signal assignment, variable assignment, 
loop, exit, and next statements may execute unconditionally and conditionally. The 
loop statement has no the iteration scheme. The behavioral description CFG 
constructed of these statements has only one segmented path. Each segment may 
be processed separately. This allows the development of efficient lifetime analysis. 
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scheduling, allocation, and binding techniques. The behavioral model for GCD 
(Mermet,1993) is shown in Figure 1. 

Transition probabilities. Probability p(v) is introduced for each conditional 
variable or signal v. This is the probability of event for object v to take value true 
(1). The probability of event for object v to take value false(O) is equal to l~p(v). 

entity GCD is 

port(CLOCK, RESET, START: in BIT; 

XI, YI: in BIT_VECTOR(15 downto 0); 

READY: out BIT; 

RES: out BIT_VECTOR(15 downto 0)); 
end GCD; 

architecture BEHAVIOR of GCD is 
attribute CYCLE_TIMEof BEHAVIOR:architecturc is 60NS; 
attribute TOTAL_AREA of BEHAVIOR:architecture is 500GT; 
attribute FUNCTION_UNITS*of BEHAVIOR:architecture is "ALUlb-l"; 
attribute EXECUTION_TIMEof BEHAVIOR:architecture is 2US; 
attribute CRITERION of BEHAVIOR:architecture is "minT”; 
begin 
Pl:process 

variable X,Y:BIT_VECTOR(15 downto 0); 
variable Vl,V2,V3,V4:BOOLEAN; 
attribute PROBABILITY of VLvariable is 1; 
attribute PROBABILITY of V2:variable is 0.05; 
attribute PROBABILITYof V3,V4: variable is 0.475; 
begin -- Statement Segment Path 

loop -1 1 A 

wait until ClOCK’Event and CLOCK=‘l’; T 



Vl:=notSTART=‘r; 


-2 


2 


if VI then READY<='0'; end if; 


-3 


2 


if VI then X:=XI; end if; 


--4 


2 


if VI then Y;=Y1; end if; 


-5 


2 


exit when VI; 
end loop; 


-6 


2 


loop 

wait until ClOCK’Event and CLOCK=‘ 


-7 

1’; 


1 


V2:=X=Y; 


-8 


3 


V3:=X<Y; 


-9 


3 


V4:=X>Y; 


-10 


3 


if V3 then Y:=Y-X; end if; 


-11 


3 


ifV4thenX:=X-Y;endif; 


-12 


3 


if V2 then READY<=T; end if; 


-13 


3 


if V2 then RES<=X; end if; 


-14 


3 


exit when V2; 


-15 


3 



end loop; 
end process; 
end BEHAVIOR; 

Figure 1 GCD special behavioral model 

If condition v7 or.. .or vn=l is true for objects v7,...,vn then p(vl)+...’\-p(vn)=I. 
The transition probabilities are defined by attributes as it is shown in Figure 1. 
Optimization problem. Two formulations of the optimization problem can be 
specified. One formulation minimizes the design execution time mathematical 
expectation with constraints on the design cost. Another formulation minimizes the 
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design cost with a constraint on the execution time. Additional constraints are the 
bounding clock cycle time, number of functional units, and others. The 
optimization problem is described in VHDL by attribute declarations and 
specifications (Figure 1). 

Functional unit description. The set of VHDL operators is partitioned into subsets 
of compatible operators or operators that may be introduced into FSM. Including 
an operator into a subset depends on the operands width. For each operator and 
functional unit the delay, area, and number of pipeline stages are defined. All the 
values are represented by formulas. 

Transformation rules. A source VHDL behavioral description is equivalently 
transformed to the special model by applying transformation rules. These modify 
the behavior CDFG to speed up the design process and to improve the design 
parameters. The rules transform a loop statement with the iteration scheme to a 
loop without the scheme, reorder independent and dependent statements, insert a 
statement into if- and toc^p-statements, extract computations from i/-statement, split 
i/-statement into separate parts, transform ^/-statement to a logical expression and 
variable assignment statement, merge ejciY-statements, unroll toc^p-statement 
without the iteration scheme. 



4 SCHEDULING FOR THE SPECIAL BEHAVIORAL MODEL 

Background. Efficient scheduling techniques include as soon as possible (ASAP), 
as late as possible (ALAP), list scheduling, freedom-based scheduling, force- 
directed scheduling, integer linear programming formulation (ILPF), dynamic loop 
scheduling, path-based scheduling, scheduling for pipelined architectures, and 
others (Camposano, 1989, Gajski, 1 992, 1 994, Goossens, 1989, Hwang, 1991, 
Jerraya,1993, Mcfarland,1990, Mermet,1993). They accept the behavioral 
description in a general form. 

Single-path-based scheduling. The novel scheduling method performs equiva-lent 
transformation of the source behavioral description. The transformation results in 
the special behavioral model which CFG has only one segmented path. The 
extended scheduling techniques use relations constructed on the sets of signals, 
variables, operators, and statements. 

Segment tree. The segment tree is a hierarchical structure of the special behavioral 
model CFG. The tree root is the process statement. The other non-terminal nodes 
are loop statements. The terminal nodes are sequential statements of the model. 
The loop or process body statements constitute a segment. The FSM states are 
introduced during top down traversal of the segment tree. 

Orthogonality and implication analysis for signals and variables. Orthogonal-ity, 
implication, and independence matrix R 's rows and columns correspond to 
Boolean and bit signals and variables. Matrix elements r^. belong to set {1, f-, 

-} where 1 defines objects i and j to be orthogonal, defines object i to imply 




Methods and tools for high and system level synthesis 



493 



object j, defines object j to imply object /, ^ defines objects i and j to be 
equivalent, defines objects i and j to be independent. The matrix is generated 
during analysis of relational operators and inferring new relations when rules apply 
to operators and, or, xor, nand, nor, and not. 

Orthogonality of statements. Two conditional statements if cl then PI; end if; and 
if c2 then P2; end if; are defined to be orthogonal iff cl and c2 are orthogonal. The 
orthogonal statement bodies are mutually exclusive and may execute on the same 
functional unit. 

Operators compatibility and proximity. There are two cases for operators to be 
compatible within one high-level finite state machine (HLFSM) state: 

• the operators belong to orthogonal statements 

• the operators are relational and have the same inputs. 

The proximity accounts for the statement common inputs and outputs and is used 
to select compatible operators to be merged. Maximizing the operators proximity 
implies minimizing the number of interconnection units. 

Statements precedence relation. Statements precedence relation PRE is union 
VALuUSEuWAT of three subrelations. Relation VAL defines statement i to 
precede statement j if i and jare not orthogonal and i has an output value which is 
an input for j. Relation USE accounts for a variable may not be assigned a new 
value while the old value is still used. Relation WAT defines all the non-orthogonal 
statements located before a wait statement to precede the wait, and the wait 
statement to precede all the nonorthogonal statements located after it. 

Scheduling for the special behavioral model. Scheduling techniques ASAP, ALAP, 
list scheduling, and others are extended for single-path-based scheduling. Given a 
special behavioral model, transition probabilities, and optimization task, the goal is 
to introduce states, distribute the statements onto the states, and gene-rate an 
appropriate schedule. To minimize the total execution time is to minimize the 
number of the HLFSM states and execution probability for each state. 

HLFSM state execution probability. The state can include several if-then state- 
ments. If conditional objects cO,...,cn are used in the state then execution prob- 
ability P(state) is estimated involving probabilities p(cO),...,p(cn) and relations 
between the objects. For the orthogonality relation P(state)=p(cO)+...+p(cn). For 
the independence relation P(state)=l-(l-p(cO))*...*(l-p(cn)). For implication 
relation cO— >cn the probability is P(state)=p(cn). In general, the state 
probability is defined by a composition of these formulas. 

Estimating the execution time. The overall execution time mathematical expect- 
ation is T=^Cycle_Time*Mj where M^ is the total number of states needed to 
execute the model. The number of states needed to execute segment i is M,= * 

(1 pexiiy where and M"" are the number of states in the segment 

and before exit, and P "" is the probability to exit from the segment. 
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5 ALLOCATION AND BINDING FOR THE SPECIAL 
BEHAVIORAL MODEL 

Background. Effective allocation techniques include rule-based expert systems, 
greedy iterating, branch and bound, clique-partitioning, linear-programming, 
simulated-annealing-based, path-based, force-directed, graph coloring, and 
interactive allocation algorithms (Camposano,1989, Gajski, 1992, 1994, Hwang, 
1991, Jerraya,1993, Mcfarland,1990, Mermet,1993). 

Single-path-based allocation and binding. Allocation and binding methods 
developed for the special behavioral model use the model advantages. The single- 
path-based allocation and binding flow is as follows. First, the special behavioral 
model DFG is generated. Using the CFG single path, the variable lifetimes are 
computed and the variables compatibility is determined. The compatibility analy- 
sis is performed accounting for the orthogonality analysis and scheduling results. 
Using the variables, operators, and statements compatibility, the functional, 
storage, and interconnection units are allocated. Each variable, operator, and 
statement is binded to a unit in such a way as to minimize the DP and FSM cost. 
DFG weighed with relations. The special behavioral model DFG is defined as 
DFG=(N,A) where N=VuPtuLt uR is the set of nodes, A=N x (VuPtuR) is the 
set of edges, V is the set of variables, Pt is the set of ports, Lt is the set of literals, 
and R is the set of statements. Variables compatibility relation Cv and statements 
compatibiity relation Cr are appended to DFG and used to fold DP. 

Reenumeration of the HLFSM states. The HLFSM states are introduced and 
enumerated during the top down traversal of the segment tree and scheduling the 
segments. To define the variable lifetime by a state interval, the states must be 
reenumerated in the order for which the tree leaves are looked through from the 
left to the right. The new enumeration corresponds to the top down traversal of the 
states, but not the segments. 

Lifetime analysis. Function Inc(v,s) defines the mode of use of variable v in state s\ 
Inc: VxS--> (0, {in}, (out) fin, out}) where V is the set of variables, S is the set of 
states, and in and out are the modes of using a variable. The variable v lifetime is 
defined by interval [sf'\ s}'^] where the bounding states are determined through 
using the segment tree structure and the orthogonality relations. 

Compatibility of variables. Variables used within one state constitute set V and are 
implemented as a wire. Variables used in several states constitute set V"* and are 
allocated on a latch, register, RAM, and ROM. Variables of set V” may be merged 
if they are compatible. Variables vl and v2 are compatible if their lifetime intervals 
are not crossed or each statement that contains vl is orthogonal to each statement 
that contains v2. The variables compatibility is described by matrix 
Compatibility of statements. Compatibility of statements implies compatibility of 
operators that is derived from the operators orthogonality, their execution in the 
same or different states, and the operation's ability to share resources. The 
operators compatibility relation is Cjf==(Shan-fSta\Ort))^in where Sha is the 
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relation built of the pairs of operators which allow sharing resources, Sta is the 
relation built of the pairs of operators executed in the same state, Ort is the oper- 
ators orthogonality relation, and Sin is the relation built of the pairs of relational 
operators that execute in the same state and have the same input values. 

Reordering operator inputs. The goal is to decrease the design cost by increasing 
the number of data dependences that can be allocated on the same interconnection 
unit. 

The data path optimization problem is formulated as 

. /w !St J, If 

min a,*c,p + a,*c,p+ac*c,p 
dp e DP 

where DP is the set of feasible data paths; Cj\ and are the functional, 
storage, and interconnection unit costs; and are factors. 

Folding techniques. Depending on the design space exploration approach, the DP 
optimization techniques are partitioned into global and local (Mcfarland, 1990). 
Both of them can fold the DP graph by merging variables, operators, and data 
dependences. The global techniques minimize the target function by searching for 
variables and operators to be merged, and by reordering operation inputs. The local 
techniques search for pairs of variables and operators to be preferably merged step 
by step. 



6 NET-BASED SYNTHESIS 

Background. The Petri net is a concurrency model widely used for representing 
asynchronous behavior of processes (Petri, 1962). 

Principles. There are two main problems in synthesis of circuits and systems 
composed of variable execution time components, how to 

• perform the scheduling, allocation, and binding tasks to optimize the design 

• build the components, synthesize the control, and construct a system. 

Both the problems can be solved within high-level synthesis net-based method- 
ology (Prihozhy,1996). 

Net schedule. The key concept of net-based synthesis is a net schedule that desc- 
ribes mixed sequential/concurrent execution of the statements. The noncyclic net 
schedule is directed graph G„=(N,H) where N={l,...,n} is a set of statement 
numbers and // is a statements direct precedence relation. The relation defines 
direct predecessors of each statement that executes when execution of all the 
predecessors is complete. Direct data dependences between the statements 
constitute the precedence relation defining the maximum concurrency net 
schedule. Two statements are sequential if a path of graph between the 
statements exists, otherwise the statements are concurrent. Statements are mutually 
exclusive if they are orthogonal or sequential. Mutually exclusive statements can 
never execute simultaneously. If mutually exclusive statements may share the 
same resources they are compatible. Matrix Q describes the data dependences 
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between the statements which element q.j equals 1 if j uses a value delivered by /, 
and equals 0 otherwise. Matrix TV’s element w.j equals 0 if statements i and j may 
not execute on the same functional unit, equals 1 if the statements may share 
resources, and equals 2 if the statements are orthogonal. Zero elements of matrix Q 
define maximum set of the concurrent statement pairs. Set D=D^ defines the 
net schedule of maximum concurrency. The schedule execution time is defined by 
the cliques of graph Gj^=(N,-D) and the schedule cost is defined by the cliques of 
graph G^=(N,D). 

Net scheduling and allocation. For any subset D of set we search for a net 
schedule of less concurrency. Two optimization tasks are possible in order to 

• minimize the net schedule execution time for constraint 5^, on the cost 

• minimize the net schedule cost for constraint on the execution time. 

One method solves the first task consecutively adding pairs to set D. Another 
method solves the second task consecutively removing pairs from set D starting 
with set D^. To select a pair to be added or removed, clique sets for graphs and 
are analyzed. The pairs that decrease the execution time and dose not increase 
the cost are the most preferable. When a pair is added to or removed from set D the 
clique sets are recalculated. Adding pairs to D is complete if the extended set 
implies cost S greater than bounding cost Moving pairs from set D is complete 
if the reduced set implies time T greater than bounding time We can use net 
scheduling for Petri nets as well. 

Existence problem. The problem is formulated as to find out whether any net 
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Figure 2 AHILES design flow 

schedule is possible for given set D or not. In (Prihozhy,1996) the problem is 
reduced to solving a combined logical equation. The method proposed to solve the 
equation labels a graph in such a way as to avoid conflicts. 

Net-based scheduling. The net schedule can be a source for synthesizing sequen- 
tial schedules. An ordinary sequential schedule is generated by ASAP and ALAP 
techniques if the statement execution time equals the clock cycle time. A seque- 
ntial schedule with chaining is generated by the list scheduling and other techniq- 
ues for constraints on the cost or on the number of functional units. A sequential 
schedule with multicycling is generated if the functional unit execution time is 
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greater than the clock cycle time. If functional units are functionally pipelined, 
statements have to be splitted into parts, one for each stage of the pipeline. 
Asynchronous circuit and system synthesis. Synthesis of asynchronous circuits and 
systems is based on net scheduling algorithms, net allocation algorithms for 
functional, storage, and interconnection units, net binding algorithms, methods of 
constructing asynchronous circuit and system components, methods of synthesis of 
asynchronous circuits and systems composed of these components. 



7 RESULTS 



The described models, methods, techniques, and algorithms are realized within the 
AHILES high-level synthesis system (Figure 2) (Prihozhy,1996). Results 
generated for five benchmarks (Courtois,1994,Mermet,1993) are presented in 
Tables 1 and 2. All the RTL-structures are synthesized on a PC 486/50. Table 1 
presents CPU time for synthesis steps. The VHDL compiler throughput is 100 to 
280 lines per second. The overall synthesis time is 5 to 14 sec. Generated RTL- 
structure parameters appear in Table 2. The internal form size is 1.4 times greater 
than the VHDL- text size for behavioral descriptions and 0.86 times less for 
structural descriptions. AHILES introduced few FSM states for all the designs. 
This is due to the preliminary transformation of the behavioral descriptions, special 
behavioral model, and novel scheduling, allocation, and binding techniques. The 



average execution time of generated net schedules is 18% less than the execution 
time of optimal sequential schedules. 

Table 1 Synthesis time (sec), PC 486/50 


Synthesis 

steps 






Benchmarks 




Bubble 


Gcd 


Gcdf 


Kalman 


Pid 


Compilation 


0.71 


0.49 


0.77 


1.48 


1.27 


Linking 


044 


0.44 


0.50 


0.49 


0.44 


Diagnostics 


0.66 


0.49 


0,55 


0.77 


0.72 


Analyzing 


0.87 


0.66 


0,66 


4.57 


1.09 


Scheduling 


0.55 


0.44 


0.44 


0.77 


0.61 


Allocation and binding 


0.88 


0.55 


0.55 


1.37 


1.04 


Data path generation 


1.10 


1.05 


0.99 


1.65 


1.38 


FSM generation 


0.60 


0.55 


0.49 


0.83 


0.66 



Table 2 Design parameters 


Parameter 




Benchmarks 




Bubble 


Gcd 


Gcdf 


Kalman 


Pid 


Behavior VHDL text (lines) 


119 


50 


60 


220 


180 


Behavior VHDL text (bytes) 


3009 


2089 


2844 


7966 


9978 


Behavior internal form (bytes) 


10148 


7160 


7512 


19478 


13680 


Statements 


79 


19 


29 


176 


122 
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Objects 


46 


15 


19 


122 


75 


CFG and DFG (bytes) 


5171 


1409 


1987 


12340 


8152 


FSM states 


20 


2 


5 


16 


23 


FSM transitions 


31 


4 


9 


29 


33 


ALUs 


0 


1 


1 


1 


1 


Functional units width (bits) 


0 


16 


32 


17 


32 


Registers 


7 


2 


2 


18 


13 


Register width (bits) 


104 


32 


64 


138 


389 


RAMs 


1 


0 


0 


3 


0 


ROMs 


0 


0 


0 


3 


1 


Collectors 


0 


0 


0 


5 


9 


Multiplexers 


4 


4 


4 


14 


8 


Multiplexer width (bits) 


68 


64 


128 


155 


227 


Multiplexer inputs 


13 


8 


8 


36 


33 


DP internal form (bytes) 


6177 


3031 


3170 


17075 


12425 


FSM internal form (bytes) 


4522 


1132 


1532 


9927 


8197 


Structure internal form (bytes) 


10699 


4163 


4702 


27002 


20622 


Structure VHDL text (lines) 


416 


164 


184 


1000 


724 


Structure VHDL text (bytes) 


12383 


4647 


5241 


31550 


22938 
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Abstract 

This paper presents a sensitivity-based test generation tool for analog mul- 
tifrequency testing. This tool generates minimal test sets that maximize the 
coverage of soft, large and hard component faults and enhance the coverage of 
interconnect shorts. The test generation procedure is illustrated for a low-pass 
biquad filter. This procedure is now being automated by integrating commer- 
cially available tools for symbolic computation and electrical simulation. 



Keywords 

Computer-Aided Testing, Automatic Test Generation, Analog and Mixed- 
signal Circuits 

1 INTRODUCTION 

Mixed-signal integrated circuits have been widely used in automotive and 
medical applications, in nuclear and space systems, in industrial automation, 
etc. Since these applications require some interaction between the analog and 
the digital world, several technologies have emerged that allow the implemen- 
tation of both circuitry types on the same substrate. In the same proportion, 
the development of testing technologies have become a need, specially for 
the analog parts included in these devices since practical analog testing so- 
lutions are lagging well behind their digital counterparts. As a consequence, 
techniques that can help in the automation of analog testing problems are 
nowadays of much interest. Within this context, this work faces the problem 
of automatically generating multifrequency tests for analog circuits. 

A few works on automatic analog test generation for AC-testing have ap- 
peared in the last years. (Nagi et al. 1993) uses a heuristic based on sensitivity 
calculations to choose the circuit frequencies to consider. The algorithm does 
not guarantee an optimal set of test stimuli. (Slamani et al. 1995) selects test 
frequencies based on a multifrequency analysis of faults in analog circuits. 
This analysis is based on the sensitivity computation of every measured pa- 
rameter with respect to soft, large and hard deviations in the nominal values 
of passive components. The set of test frequencies generated is rather large, 

VLSI: Integrated Systems on Silicon R. Reis & L. Claesen (Eds.) 
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since it maximizes the sensitivity of the measured parameters for each faulty 
component. (Mir et al 1996) proposes a multifrequency test generation and 
fault diagnosis procedure and applies it to analog circuits that embed checkers 
with an observable go/no-go digital output. The procedure finds a minimal 
set of test measures and a minimal set of test frequencies which guarantee 
maximum fault coverage and maximal diagnosis. 

In this paper a new automatic test generation procedure is proposed which 
enlarges the set of analog faults considered and merges the main features of the 
previous works: the sensitivity analysis of soft faults, large deviations and hard 
faults (Slamani et al 1995), the search of minim 2 d sets of test measures and 
test frequencies (Mir et al 1996) and, the generation of tests for interaction 
shorts based on fault simulation (Nagi et al 1993). Based on these three axes, 
an automatic test generation tool is imder development that makes use of 
commercially available software (SSpice, Maple and HSpice) in association 
with some in-house made tools. 



2 PRELIMINARIES 

In order to detect faults in analog circuits, it is necessary to define a valid 
range for the output parameters. This range must be defined by the designer 
so that any output value within it is considered correct. These values are 
related to each design and to the accuracy of the components and the test 
equipment in use. Test parameters can be associated to the primary outputs 
or to some internal nodes of the circuit. We consider herein the former. 



2.1 Fault modeling 

The fault model used includes soft and large component deviations and, hard 
faults (Slamani et al 1995). Shorts between pairs of nodes (interaction faults) 
were modeled in (Caunegre et al 1996) and are also addressed here. Faults 
internal to operational amplifiers are not considered in this work. A brief 
description of the fault modeling is presented in the following. 

(a) Soft faults, large deviations and hard faults 

Soft faults are small deviations around the nominal value of a component. 
They may cause a circuit malfunction such as a change in the cut-off frequency 
of a filter, in the output gain of an amplifier, etc. 

Large deviations are also deviations in the nominal value of components, 
but of a greater magnitude (in general around 50%). They still cause circuit 
malfunctions but their effects are quite different from those observed for soft 
faults. 

Hard faults are serious changes in component values. These faults can even 
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modify the structure of the design and are usually due to interconnect or 
component open and short circuits. 



(b) Interaction faults 

A fault that often occurs in most board manufacturing technologies is the 
short between nodes of the circuit. This kind of fault was modeled in (Caunegre 
et al 1996) but was not considered in previous automatic test generation tools. 
Because its effect is strongly dependent on the circuit topology, there is no 
methodology test that ensures its detection. Figure 1 shows an example of 
this fault in an analog block. In this work pairwise shorts involving circuit 
nodes that are not terminals of the same component are dealt with. 



1 ohm 



r 






^ I |J_ 



Figure 1 Analog interaction fault modeled as a resistive short of 



2.2 Sensitivity computation 

The circuit sensitivity is defined as the effect on a performance parameter 
Tj of a deviation in an element X{ of the circuit. This relative deviation can 
be expressed in different ways depending on its magnitude. The differential 
sensitivity is applied to small deviations (soft faults) only, while the incre- 
mental sensitivity is used for small and large deviations (including hard 
faults). Then, considering an output parameter Tj denoted as and an 
element Xi, we have: 



1. Differential 


Sensitivity : 


j,. _ Xi STj 

Tj Sxi 


(1) 


2. Incremental 


Sensitivity : 


cTi 

pTi = 


(2) 



where 5^ is the differential sensitivity of the denominator of Tj. 
Sensitivity analysis can be made experimentally (Ayari et al 1995) or sym- 
bolically (Slamani et al 1995). Herein, the symbolic calculation was chosen. 
This way, it is possible to optimize the test generation process and obtain 
more information about a faulty analog circuit. 
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Figure 2 Proposed test generation methodology 



3 TEST GENERATION PROCEDURE 



Figure 2 presents the test generation procedure that we propose. Basically, the 
procedure consists of a structural circuit analysis starting from the definition 
of its transfer function (this allows to make a symbolic sensitivity analysis). 
After the definition of the transfer function and its sensitivity analysis, a 
minimal set of parameters and input signals that guarantee maximal fault 
coverage is computed. This set must also consider interaction faults, so that 
these faults can be detected just as component faults are. 

Every step of the test generation procedure is further discussed in the fol- 
lowing: 



Step 1; Once the transfer function is defined, the symbolic analysis of the 
sensitivity is carried out by means of equation (2). 
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Step 2: After the sensitivity computation, three variables defined by the de- 
signer must be considered: 



2.a) the test parameter tolerance, that corresponds to the maximal out- 
put deviation that is still considered correct. This information depends 
not only on the circuit specification, but also on the accuracy of the test 
equipment that will be used to measure the parameters. 

2.b) the minimal deviation to detect in each component. This depends 
on the accuracy of the components used in the design. 

2.c) the range of input signals to be considered. 



Then, two fault detection information are made available to the test engi- 
neer: the minimal sensitivity from which the defined deviation will be detected 
and, the minimal deviation of each component that can be detected by the 
defined tolerance. With this, the test generation tool can show to the designer 
both the input ranges to be used to detect deviations above the minimal value 
and the elements or faults that are outside the detection regions. 



Step 3: The minimal set of detection parameters and input signals is com- 
puted as follows: 



3.a) For each test parameter and each component, a sensitivity-frequency 
curve is plotted for each kind of fault (soft, large, open and short). The 
resulting set of curves brings the information about the input ranges 
that detect every fault on a given component, considering deviations 
above and below the nominal values. 

3.b) In each set of curves, the most restrictive range (or set of ranges) is 
taken so that it is ensured that all faults on that component are detected. 

3.c) Properly combining (per test parameter) the set of ranges given by 
all component curves, a minimal set of test signals can be chosen so that 
maximal detection is achieved. 



Step 4: Then, the minimal deviations actually detected for each component 
are computed for the test signals obtained in the previous step. 

Step 5: The interaction faults are dealt with in this step. The ranges defined 
in step 3.C are used to simulate these faults and evaluate their coverage. For 
those interaction faults not detected by the existing test signals, new input 
test stimuli are selected from the comparison of the electrical simulation of 
the faulty circuit to the electrical simulation of the fault-free circuit along 
the frequency range of interest. 




508 Part Twelve Testing in Complex Mixed Analog and Digital Systems 



4 EXPERIMENTAL RESULTS 

In order to validate the procedure discussed in section 3, it was applied to 
the biquad filter presented in figure 3. The results obtained in each step are 
explained below. 




4.1 Transfer function definition 



The frequency-domain parameters that can be observed from the filter pri- 
mary output are the gain and phase associated to node 7. The gain (G) and 
the phase ($) parameters are defined as: 



G=y/( 



I RlR4Rd I 

I {4R3R2C2ir*pRlRdCl - 2jRZR2C2itfRl - R4Rd)Rg \ 



(3) 



$ = arctan 



1 ARgmB2C2T^^pRlRdCl - RgRARd 
^2 RgRZR2C2T^fRl 



(4) 



4.2 Sensitivity computation 

Using the definitions of the sensitivity presented in section 2.2 (equation 2) 
and the gain and phase equations (equations 3 and 4), the computation of 
the sensitivity was symbolically performed using a powerful tool for algebraic 
manipulation (MapleV 1991). 

The parameter tolerances and minimal component deviations were defined 
as 5% (steps 2. a and 2.b in section 3) and the input range (frequency range) 
as [OHz.. 5000Hz] (step 2.c). 



4.3 Selection of the minimal set of peurameters and input 
signals 

The incremental sensitivity curves were plotted for each test parameter and 
each component considering the predefined fi-equency range. Two gain sensi- 
tivity curves for Rd are shown in figure 4. 
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Rd sc a n si t i vi t y for soft faults - Gain Rd sen si t ivi ty for open - Gain 





Figure 4 Gain sensitivity for Rd soft deviations and open circuit respectivily. 



These curves include a straight line corresponding to the minimal sensitivity 
value that ensures the detection of the component deviations. This value is 
given by the expression sensmin = ^^eviaU^ (steps 2.a and 2.b). 

Prom all the curves plotted for all components, the test frequency ranges 
were computed and are presented in tables 1 and 2 for the parameters gain 
and phase. 

Table 1 Test frequency ranges to detect each kind of fault using the test 
parameter gain 



Component 


Soft 


Large 


Hard Faults 




Faults 


Deviations 


Short 


Open 


Rg 


[0..5000] 


[0..5000] 


[0..5000] 


[0..5000] 


Cl 


[1825.. 5000] 


[270..745] or 
[845..5000] 


[1.5..5000] 


[192..5000] 


Rd 


- 


[300..2200] 


[1.5..5000] 


[200..3724] 


R1 




[0..737] or 
[887..1925] 


[0..5000] 


[0..3438] 


R2 


[850. .3375] 


[437..5000] 


[362..5000] 


[35..5000] 


R3 


[850..3362] 


[437..5000] 


[355..5000] 


[35..5000] 


R4 


[825.. 5000] 


[487..5000] 


[1.6..5000] 


[375..5000] 


C2 


[850..3400] 


[440..5000] 


[45..5000] 


[361..5000] 



Prom these tables, the following can be concluded: 

1. Deviations in Rg will affect only the test parameter gain. No fault in this 
component generated a phase sensitivity curve where a frequency range 
could be selected. 

2. Por the phase parameter, the minimal sensitivity necessary to detect soft 
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faults was not achieved for any component. This indicates that the design 
is robust for this parameter, what means that only deviations above the 
components accuracy will be signaled as actual faults. Looking at the gain 
parameter, only Rd and R1 have this feature. 



Then, the minimal set of frequency ranges that guarantees maximal cover- 
age is obtained from the ranges shown in tables 1 and 2 for the parameters 
gain and phase: 



• Gain: [1825..1925]Hz 

• Phase: [475..722]Hz or [865..2512]Hz considering that faults in Rg are not 
detected. 

Table 2 Test frequency ranges to detect each kind of fault using the test 
parameter phase 



Component 


Soft 

Faults 


Large 

Deviations 


Hard Faults 
Short Open 


Rg 


- 


- 


- 


- 


Cl 


- 


[475..5000] 


[2.75..5000] 


[412..5000] 


Rd 


- 


[240..722] or 


[130..755] or 


[0..5000] 






[865.. 5000] 


[832..5000] 




R1 


- 


[220.. 2525] 


[127..3500] 


[0..5000] 


R2 


- 


[212..2525) 


[125..3520] 


[0..5000] 


R3 


- 


[220..2525] 


[125..3525] 


[0..5000] 


R4 


- 


[232..2520] 


[125..3488] 


[0..5000] 


C2 


- 


[222..2512] 


[125..3475] 


[0..5000] 



For this example and for the gain parameter, a single input frequency is 
sufficient to detect faults in all components. For the phase parameter, a single 
frequency can also be chosen, but faults in Rg will never be detected. It means 
that the gain is the best test parameter and any frequency within the selected 
range can be used as a test frequency. For the sake of simplicity, the test 
generation tool chooses a frequency in the middle of the range. 

Other circuits may need more input stimuli to ensure maximal detection of 
faults. Since it is desirable to decrease the number of input signals necessary 
to test, the gain is the parameter selected for fault detection in the filter 
components. 

The next step consists of computing the minimal deviations actually de- 
tected by each parameter at the circuit output. An information that comes 
along is what faults cannot be detected by this method. The sensitivity values 
of each parameter for every component in the boundaries of the test frequency 
ranges are computed. Table 3 shows the minimal deviations obtained for the 
test parameter gain. 
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Table 3 Minimal component deviations detected by the filter gain 



Component Nominal 
Value 


1825Hz 


1875Hz 


1925Hz 


Rg 


10 


-4.76% 


-4.76% 


-4.76% 






-f5.26% 


-1-5.26% 


-+5.26% 


Cl 


20“® 


-4.97% 


-4.95% 


-4.93% 






-1-5.50% 


-1-5.47% 


-1-5.45% 


Rd 


lOK 


-18.98% 


-19.94% 


-20.91% 






-f-26.89% 


-1-29.09% 


-1-31.43% 


R1 


lOK 


-20.74% 


-21.56% 


-22.39% 






-t-40.68% 


-f-43.63% 


-1-46.81% 


R2 


lOK 


-4.03% 


-4.06% 


-4.09% 






-1-4.45% 


-t-4.49% 


-1-4.52% 


R3 


lOK 


-4.03% 


-4.06% 


-4.09% 






-f4.45% 


-1-4.49% 


-1-4.52% 


R4 


lOK 


-4.26% 


-4.29% 


-4.32% 






-f-4.20% 


-1-4.23% 


-1-4.26% 


C2 


20-® 


-4.03% 


-4.06% 


-4.09% 






-1-4.45% 


•+•4.49% 


-1-4.52% 



Prom table 3 the following remarks can be made: 

1. The elements that do not have a detection range in table 1, present in table 
3 a minimal deviation greater than the minimum defined by the designer 
(5%). 

2. For the other components checked by the gain parameter, deviations smaller 
than 5% will be detected and labeled as faults. It means that the design is 
not sufficiently robust when considering these elements and this parameter. 
All these data are made available to the designer by the test generation 
tool and he can thus choose the best solution: either change the design, or 
keep the test procedure despite of the possible false rejection of parts. 



4.4 Coverage of interaction faults 

In order to check the coverage of interaction faults, electrical simulations were 
performed using HSpice and detection ranges were determined. The results of 
these simulations are presented in table 4. 

The first column indicates the fault simulated (1-3 is a short between nodes 
1 and 3). The second and third columns show the detection ranges, in hertz, 
for gain and phase, respectivily. oo denotes values above lOOKHz. 

Prom table 4, it can be noted that the coverage of interaction faults for 
the test parameter and test frequency chosen (Gain at 1875Hz) is 10 over 14 
possible faults (71,4%). 
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Table 4 Detection ranges for interaction faults for each test parameter 



Fault 


Detection 

status 


New gain 
range 


New phase 
range 


1-3 


detected 


- 


- 


1-4 


detected 


- 


- 


1-5 


detected 


- 


- 


1-6 


detected 


- 


- 


1-7 


detected 


- 


- 


2-4 


detected 


- 


- 


2-5 


detected 


- 


- 


2-6 


undetected 


[490..1.86K] or [2K..oo] [147.. oo] 


3-5 

3-6 


undetected 

detected 


[3K..OO] 


[1.48K..OO] 


3-7 


undetected 


[0..1.8K] or [2K..OO] 


[758..0O] 


4-6 


detected 


- 


- 


4-7 


detected 


- 


- 


5-7 


undetected 


[24.5K..OO] 


[1.1K..OO] 



New ranges allowing the detection of the four undetected faults were deter- 
mined by electrical simulation of the faulty circuit along the whole frequency 
range. 

The individual new ranges are shown in the third and fourth columns of 
table 4. By intersecting these individual ranges, one obtains the following 
global ranges for the gain and the phase parameters: 

• Gain: [24.5K..oo]Hz 

• Phase: [1.48K..oo]Hz 

Note that the new range obtained for the gain requires very high frequencies, 
for which the output gain is extremely low, since the filter has a low-pass 
function. This way, the test parameter to choose for improving the detection 
of interaction faults is the phase. 

Then, the parameters to observe and the final detection ranges for this 
circuit are: 

• Gain: [1825.. 1925], for the detection of component faults. 

• Phase: [1.48K..5K], for the detection of interaction faults. 

5 COMPUTER-AIDED TESTING TOOLS 

The test methodology proposed in sections 3 and 4 is based on the possibility 
of optimizing and automating the test process in order to lower testing costs. 
Some commercial tools used in this work for symbolic computation (Maple V) 
and for electrical simulation (HSpice) were mentioned in previous sections. 
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For the transfer function definition, tools such as SSpice and XFUNC, both 
running in DOS environment, can be used. They receive a structural Spice-like 
description of the circuit and generate the corresponding transfer function. 

The sensitivity computation is made over the transfer function using some 
mathematical definitions. This way, any tool capable of implementing differ- 
ential calculus and of plotting features and curves can be used to improve the 
test generation process. This work used Maple V, a powerful tool available 
in-house that runs in UNIX environment, although there are commercial ver- 
sions for WINDOWS. MATHCAD is another possibility that can be used in 
a WINDOWS environment. 

The tools briefiy described above can generate the first detection ranges that 
will be used to determine the coverage of interaction faults. For this analysis, 
an electrical simulator is needed. HSpice, a standard simulator running in 
UNIX was used, since the first structural description was made in its input 
language. This tool is also available in DOS environment. 

In order to fully automate the test process, some additional tools are also 
needed. They will interface the commercial tools in use by way of parsers. 
Besides that, a program that determines the coverage of interaction faults 
and defines new ranges to undetected faults is essential and is now under 
development. 



6 CONCLUDING REMARKS 

This work has presented a test generation methodology for analog circuits. 
A sensitivity-based test generation tool was proposed that can automatically 
generate a minimal set of test signals that guarantees maximum fault de- 
tection. The shorter the test set, the cheaper the test process, since less test 
stimuli must be applied to the circuit under test and less test parameters must 
be measured. The fault model used considers interaction shorts in addition to 
soft and large component deviations, and hard faults (opens and shorts). 

The test methodology was illustrated by means of a case-study, a biquadratic 
low-pass filter. The test parameters considered were the gain and the phase 
of the filter. The gain showed to be the best parameter to detect faults in 
components and, the phase to be the best to detect interaction shorts. The 
method was also applied to a non-linear circuit, a voltage controlled oscillator 
(555-like oscillator). In this case, the test parameter measured was the output 
frequency. Two values for the voltage control (2.5V and tri-state) were enough 
to detect every fault in the model. 

Future works include but are not limited to the fully automation of the test 
generation process and its extension to the diagnosis of both components and 
interconnect faults. 
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Abstract 

A multi-chip module implementation is shown which provides new features and 
conditions for mixed-signal boundary-scan testing of MCMs. Assembled non 
boundary-scan mixed-signal bare dies and integrated passive devices can be tested 
using active-substrate DfT circuits. The MCM-D is very well suited for electron- 
beam diagnosis. 

Keywords: Mixed-signal/analogue testing, MCM testing, P1149.4 boundary- 
scan, e-beam testing 



1. INTRODUCTION 

In the past years, multi-chip module (MCM) techniques which were previously 
only used in expensive high-speed computers or military applications, are now 
also beginning to find their way in high-volume consumer microsystems. 
Examples of their use can be found in high-end televisions, mobile telephones or a 
compass wristwatch (Tangelder, 1997). 

A cost-effective approach in testing PCBs makes use of digital boundary-scan 
(BS) techniques as specified in the IEEE standard 1149.1 (Bleeker, 1993). Since 6 
years, work has progressed with regard to an extension of the 1149.1 standard 
towards the 1149.4 mixed-signal boundary-scan standard (Sunter, 1996). 
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In recent years, digital BS techniques are also being introduced for testing MCMs 
(Zorian, 1992). In practice, unfortunately not all bare dies on an MCM substrate 
incorporate BS cells. This holds for digital BS cells and, in the current absence of a 
1 149.4 standard, certainly for analogue BS cells. This reduces the effectiveness of 
the boundary-scan approach in a microsystem, especially with regard to the detail 
of diagnosis. 

In this paper, an implementation is shown of a silicon-substrate based MCM-D. 
All interconnection wiring and passive as well as active devices can be integrated 
in this substrate. Based on this implementation, different constraints apply in terms 
of the requirements of mixed-signal BS hardware as compared to a PCB 
environment. As a result, a selected set of new BS cells have been designed based 
on our MCM-D design rules. Also the associated test-control hardware has been 
designed. This set of DfT circuits provides excellent testing possibilities for MCM- 
Ds. 

Although only briefly discussed in this paper, the completely planar structure of 
the MCM-D implementation is very suitable for electron-beam diagnostics of 
complete MCMs. 



2. PLANAR MCM-D IMPLEMENTATION 

In order to implement a high-density MCM for a microsystem (Tangelder, 
1996), a test MCM-D has been designed, implemented and evaluated to 
investigate its potentials. The manufacturing process has not been presented 
before, although some elements can be found in other publications (Hopper, 1992). 

The MCM-D substrate used is p-type silicon. It forms the basis for a 
conventional CMOS process flow, currently employing two metal levels and two 
poly-silicon layers. The latter are used to implement passive devices (R, C). There 
is however a difference between the MCM-D process and our conventional CMOS 
process with regard to the minimum dimensions used. This has to do with the 
direct-contact (1:1) MCM-D lithography. 

The interconnections on the substrate are established by the first and second 
metal, except for the bridge from the substrate to the bare dies. These bridges, 
implemented in third metal, replace the relatively error-prone wire-bondings at the 
cost of an additional IC mask. 

A cross-view drawing of the substrate with an embedded bare die is shown in 
Figure 1. A photomicrograph of part such an MCM is seen in Figure 2. The overall 
dimensions of this MCM are 210mm by 220 mm. For the sake of clarity, one hole 
for a guest bare die has been left open, while the other one is populated with a 50k 
digital Sea-of-Gates chip. 
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Figure 1 Cross- view of the populated MCM-D 




Figure 2 Photomicrograph of the partly completed MCM 



The substrate is covered on both sides with silicon nitride acting as a "hole" 
mask. Next, holes are anisotropicly etched in the substrate. 
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Therefore the hole in front is slightly larger than the dimensions of the guest bare 
die, while the back-end is larger. This is favourable for the thixotropic paste to 
mount the bare die within the substrate. 



3. MIXED-SIGNAL BOUNDARY-SCAN IN MCM-DS 

The MCM-D environment is different from the one in a PCB with regard to 
boundary-scan testing. This has a direct impact on the electrical requirements of 
the mixed-signal (MS) BS cells. As previously discussed, the design rules for these 
MCM-Ds are less aggressive than those in current advanced CMOS processes 
which influences the BS-cell capabilities All these constraints resulted in the 
design of dedicated mixed-signal BS cells in our MCM-D process. 



3.1 Mixed-Signal BS in active-substrate MCM-Ds 

Although criticised by several semiconductor companies and not yet an IEEE 
standard, the current PI 149.4 draft provides sufficient information for designing 
and evaluating several realisations of the suggested hardware (Lofstrom, 1996). 

In many libraries of semi-custom semiconductor manufacturers, digital 
boundary-scan cells are available now, including the associated test control such as 
the TAP controller (Koopman, 1992). However, a full coverage of bare dies for 
MCMs which are completely boundary-scan testable (1149.1/Pl 149.4) is not 
available nowadays and it remains the question whether this will ever be achieved. 

A full boundary-scan bare-die population is not a prerequisite for the BS concept 
in MCMs. Furthermore, the research on testing non-BS clusters is progressing 
(Lubaszewski, 1994), but the testability of separate bare dies and passive 
components within the clusters will remain limited. In many cases this is not 
acceptable, as several (potentially good) bare dies have to be removed, replaced 
and rewired for often one faulty bare die. 

We compensate for this limitation by introducing digital as well as analogue BS 
cells around the non-BS bare dies in the substrate of the MCM. To use the active 
substrate of an MCM for introducing digital BS cells in the case of non-boundary- 
scan cells has already been presented earlier (Maly, 1994). For mixed-signal 
MCMs, no investigations have been carried out yet. 

It can be calculated (Oliver, 1996) that in several cases it is even more cost- 
effective to implement BS cells in the active substrate instead of in the bare dies. 
In the extreme case, the bare dies can be glued on top of the active-substrate BS 
cells, and subsequently interconnected. 
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For the sake of simplicity, Figure 3 shows a symbolic representation of an 
example of an MCM-D with two non boundary-scan bare dies and a passive 
component Z. Furthermore, each bare die is assumed to have analogue and digital 
parts. In a practical situation, more passive components and several analogue as 
well as digital bare dies are involved of which some digital ones will feature 
digital boundary-scan facilities. 

The black boxes in the bare dies are the bonding pads which are connected to 
digital I/O buffers (D-I, D-0) and analogue input and output buffers (A-I, A-0). In 
our MCM-D approach, these bare-die bonding pads are connected to the BS cells 
via small metal bridges (see Figure 1). Bare-die digital I/O can be connected to 
bidirectional digital BS cells (DBC Digital Boundary Cell) and analogue inputs 
and outputs to bidirectional analogue BS cells (ABC = Analogue Boundary Cell). 
It is of course also possible to design and use unidirectional cells. 

In both cases, additional ESD protected buffering (grey triangles in Figure 3) is 
required for BS cells connected to MCM input and output bonding pads. The black 
boxes in the MCM-D substrate are the MCM bonding pads to be used for wire 
bonding to the ceramic PGA package. 




Figure 3 Example of active BS architecture in an MCM-D 

Besides the MCM system digital input (DI) and output (DO) and their analogue 
counterparts (AI, AO), also bonding pads are incorporated for BS testing purposes. 

For simplicity, the BS control pads (TDI, TDO, TMS, TCK, TRST*) have not 
been connected to the test controllers (TCON) in Figure 3. The internal TDO of the 
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first controller is connected to the internal TDI of the second one. Hence, the 
shown TDO and TDI pads are the external ones. 

A TCON consists of an extended 1149.1 TAP controller, instruction register and 
decoder. The external analogue test bus (ATI, AT2) is connected to the Analogue 
Test Bus Control (ATBC), which is the circuit controlling the routing between the 
external bus and internal bus (ABl, AB2). It is possible to share parts of the 
hardware in the TCONs and ATBCs. 

Figure 3 shows that the concept of Known Good Substrate (KGS) testing 
requires more effort in our MCM-D case than in other approaches. It basically 
requires mixed-signal wafer testing. For this reason, bridging pads and BSD 
protection and some form of buffering is also required at the "bridge"-side of the 
active-substrate BS cells because of substrate test handling. 



3.2 Requirements of BS cells in MCM-Ds 

The mixed-signal BS cells were primarily designed for application in our type of 
sensor-based microsystems (Tangelder, 1997), which generally feature low 
frequencies (<10MHz) and modest voltage and current levels. Hence, the 
requirements of the analogue switches in the ABCs (Sunter, 1996) and (Lofstrom, 
1996) are relatively modest. The use of BS in MCM-Ds in the current approach 
requires several types of buffers for different purposes: 

1. BSD-protected, full-size buffers for the MCM-D analogue and digital inputs 
and outputs, able to handle electrical requirements at the PCB level. Standard 
CMOS bond-pad buffering circuits can be used. 

2. BSD-protected, medium-size input and output buffers between bridges and 
BS cells to enable KGS testing with active probes (0.1 pF, ILBAK <0.1 pA) 

3. Non-BSD protected, small output buffers to interconnect BS cells from 
different bare dies on the substrate. 

In principle, buffer 1 can do all jobs but this is rather inefficient in terms of area 
and power consumption. The buffer 1 type is located around the MCM bonding 
pads. 

Buffer 2 is integrated in the BS cell together with a number of switches, such as 
the ones for ’’core disconnect” and the ones to the V+ and V- in the analogue case 
(Sunter, 1996), (Lofstrom, 1996). Buffer 2 can also do the job of buffer 3. 

Buffer 3 is also integrated in the BS cell. 
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As Figure 3 indicates, the location of BS cells differs in the MCM-D approach 
from PCBs and "conventional" MCMs. In the case of an active substrate, the BS 
cells are equally distributed around the bare die. 

In an MCM-D, the analogue or digital cores as specified in standard (1149.1) 
and draft (PI 149.4) are only accessible via ESD-protected I/O buffers. In most 
(analogue) cases, these have no facility to be put in an off-mode state, and hence 
the conceptual switch (core disconnect) (Sunter, 1996 and Lofstrom, 1996) should 
be incorporated in the BS cells. 

Another difference of the MCM-D with a PCB has to do with the passive 
components. Whereas in PCBs missing components and wrong component types 
are reality, this situation is very unlikely to occur in MCM-Ds. The passive 
components (as well as the active) are evaluated during the Known-Good Substrate 
(KGS) tests. Hence, BS facilities for evaluating passive devices are rarely used. 

As result of our implementation technique, also other design considerations 
emerged for the BS cells. It was assumed that the dimensions of the bare-die 
bonding pads are basically governed by required mechanical spacing of 
conventional wire-bonding techniques. The minimum size to be expected is around 
80 by 80pm and the spacing is minimal 100pm. In our concept this does not hold 
for the bridging pads in the substrate. As it is still required to be able to carry out 
"Known-Good Substrate" tests, especially if active devices are incorporated, the 
bridging pads have to be around 25pm by 25pm to enable the use of active 
microprobe access. In the case of conventional wafer probing, bridging pads of 
80pm by 80pm are required. We have chosen for the last option in this first test. 
The distance between these bridging pads is governed by the spacing of the bare- 
die bonding pads (~ 100pm). The bridging pads are integrated in the BS cells. The 
distance of the BS cells from the etched holes (Figure 1) is (200pm + 20pm=) 
220pm, where the first value accounts for the slope and the second one for the 
distance between the mask of the hole and the cell. With all the above data, the 
architecture, pitch and position of the BS cells is now fixed. To reduce the 
substrate wiring, the test-control circuits are located near the bare dies. 



3.3 Design of BS-cells in active substrate 

Although digital BS cells and the associated TAP controller have been designed 
in the past at our institute using the Sea-of-Gates design methodology (Koopman, 
1992), a full-custom redesign had to be made in the MCM-D process. The design 
of the selected BS cells was implemented such, that abutment of neighbouring 
cells is possible. If the distance between the bare-die bond pads is larger than the 
minimum, a flexible metal-interconnection cell of variable width can be 
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automatically inserted between neighbouring BS cells. These interconnection cells 
have to be transparent to enable incoming and outgoing signals which do not 
require BS cells, such as the power-supply lines. 

The DBC has been designed according to the 1149.1 standard (e.g. Bleeker, 
1993), employing a D-Flipflop, a D-latch and two MUXs. In the design they are 
combined with buffers. 

The ABC uses the same type of flipflops, latches and MUXs as the DBCs. The 
difference lies in additional control logic for the analogue switches, and of course 
the requirements of the different switches. A detailed scheme of the 5-switch ABC 
can be found in (Sunter, 1996) and (Lofstrom, 1996). One can recognise a sliced 
architecture in this cell. A symbolic scheme of such a slice is shown in Figure 4. 

The switches have different capabilities and implementations. The core 
disconnect and V+ and V- have been integrated in the "bridge** buffers, which are 
required for KGS testing anyhow. The other switches have been implemented on 
the basis of transmission gates with different widths. As stated before, our 
application area does not require very high performance switches. 
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Figure 4 Control slices and transmission-gate switches and switched buffers in 
the ABC 

The transistor gate dimensions used in the MCM-D process are 3pm by 15pm for 
the NMOS and 3pm by 45pm for the PMOS transistors. The boundary-scan cell 
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slice comprises an inverter, D-Flipflop, D-latch and simple logic. Including an 
analogue transmission-based switch, it measures 180pm by 180pm and 
incorporates 30 transistors. The layout of this type of boundary-scan cell slice in 
our MCM-D process is shown in Figure 5. 

The extended TAP controller has been automatically synthesised using the 
"Compass logic synthesiser", employing custom-made logic primitive blocks 
(Inverter, NAND, NOR and FlipFlop) in our MCM-D process. After automatic 
placing and routing the layout required an overall area of 1050pm by 1535pm. 

The ATBC is implemented with the same slices as discussed previously. The 
TECON and ATBC are located near the bare die BS cells. 




Figure 5 Layout of the Boundary-Scan cell slice in our MCM process 



4. INITIAL RESULTS 

As could be expected from the relatively large minimum dimensions as dictated 
by the MCM-D process, there were speed limitations in terms of correct operation. 
Above 30 MHz, the signals in the BS cell slice deteriorated to the extent that 
correct operation of the cells could not be guaranteed anymore. This also holds for 
the test control parts. The resistance values of the analogue switches have an 
influence on the analogue testing capabilities in the BS concept. The difference in 
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"on" and "off resistance of the designed transmission gates was limited, around a 
ratio factor of 100; this can be increased, by enlarging the width at the cost of 
speed or by incorporating an operational amplifier at the cost of increased active 
area. 



5. ADVANTAGES FOR ELECTRON-BEAM DIAGNOSIS 

Although the practice of using electron-beam probers for testing planar MCM 
substrates has been shown before by using charge / discharge techniques 
(Brunner, 1993), the suggested MCM implementation in this paper has a number 
of advantages in comparison with conventional MCM implementations in electron- 
beam testing of the entire MCM. By this is meant the populated substrate, 
including the bare dies. 

In the conventional case, thick (~ 800 \m) bare dies are mounted on top of a 
substrate, hence causing focusing problems in the beam switching between 
substrate and bare dies. In addition, high-rising bonding wires around the bare dies 
can create disturbances of the beam (generated magnetic fields). In our completely 
planar MCM these problems can not occur, hence easing the road to electron- 
beam diagnosis of complete MCMs (including bare dies !). For the analogue parts, 
there are of course clear limitations in achievable sensitivity and frequency range. 

First diagnostic experiments with an electron-beam prober have started using an 
extended wafer stage for MCMs able to accommodate MCM sizes up to 8 * 8 cm. 
Subsequent repair of planar MCM-Ds is only confined to repair of metallisation 
tracks on the substrate, bridges and bare dies. Our experience shows that this can 
be successfully accomplished with focused ion-beam systems if required. 



6. CONCLUSIONS 

In this paper, a design has been made of complementary mixed-signal boundary- 
scan cells in the active substrate of an MCM-D. Specific requirements have been 
presented to support this environment. The current implementation of the 
boundary-cells is larger in area and slower in speed than could be achieved in an 
advanced IC process. It is stressed that the active devices are only used to 
implement boundary-scan cells for non-boundary-scan bare dies and associated 
registers and controllers, in order to keep the occupied active area and hence yield 
of the substrate to acceptable levels. However, it makes a completely mixed-signal 
boundary-scan based testing approach feasible in complex MCM-D microsystems. 
The planar structure is also very well suited to provide electron-beam diagnosis of 
the substrate and bare dies at the same time. 
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Abstract 

With the increasing complexity of VLSI circuits, reliability has been of great 
concern to the semiconductor and system manufacturers. The existing test 
methodologies, e.g. application of test set for stuck-at faults, are good for 
ensuring that the circuit is operational at the time of testing. Due to imper- 
fections in manufacturing process, some circuits have infant mortality prob- 
lems, which implies that the circuits have a high probability of failure during 
the early part of its useful life. Burn-in, which stresses the circuit under con- 
trolled environments, has been traditionally used to weed out circuits with 
infant mortality problems. However, some practical considerations such as 
cost or unavailability of burn-in ovens may limit the use of burn-in. If the 
cost of burn-in is prohibitive, then there are no known alternatives for im- 
proving reliability of the shipped circuits. In this paper we propose a stress 
testing method which can provide an attractive low-cost alternative to burn- 
in. Stress testing can also be applied at the die level to generate Known Good 
Dies (KGD) for Multi-Chip Modules (MCMs). The test methodology can gen- 
erate electrical or current stress in a circuit achieved using any of the following 
methods, namely (1) by generating high current density in the all parts of cir- 
cuit, or (2) by generating high current density in selected parts of the circuit, 
resulting in hot spots and thermal gradient Since stress testing should achieve 
high fault coverage to cover functional faults as well, one clever idea is to use 
a stuck-at fault test set, and modify it to perform the stress test. 

Keywords 

Testing, Burn-in, Power dissipation 



1 INTRODUCTION 

Electronic integrated circuits (ICs) are increasingly found in new application 
domains, for example, medical and automotive applications. Such use is pos- 
sible due to the miniaturization through advancement of technology, resulting 
in higher performance, lower cost, and increased complexity of ICs. Applica- 
tion domains, in which a failure can have catastrophic effects, require circuits 
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Region 



Figure 1 Bathtub curve for failure rate vs. time 



with higher quality and reliability. Even the best design and fabrication tech- 
niques can propagate some reliability and/or device quality problems with 
the end product. In order to ensure product quality, testing is done at various 
levels during the design process and circuits failing the test are discarded. It 
has been observed that after the chip is fabricated, the failure rate follows 
the bath tub curve as shown in Figure 1. During the infant mortality period or 
the early life region, the failure rate is much higher compared to that during 
the useful life region. This phenomenon occurs because in a large sample of 
ICs produced, there are some units which have some borderline manufacture 
ing defects, such that their life is relatively short compared to other units 
of the same sample. These units are said to have infant mortality problems. 
From a given sample, if all the units with such infant mortality problems are 
sorted out, then the reliability of the remaining units of the sample will be 
higher. Units with infant mortality problems often have latent defects which 
are not noticed during normal testing of the VLSI circuits. Such defects can 
be activated by stressing the circuit with higher current densities and tem- 
perature. It has been observed in [7] that larger current densities can stress 
defects such as epitaxial and crystal imperfections, metallization, oxide, and 
junction anomalies. Many of these defects require localized current stress to 
provide activation for the associated failure mechanism. 



1.1 Burn-in vs. Stress Testing 

Burn-in is a well-known technique that can provide the stressing of the circuit 
needed for weeding out circuits with infant mortality problems [6]. However, 
burn-in is a very expensive procedure. It requires special ovens to maintain 
a high temperature. Among the three known techniques of burn-in, static, 
dynamic, and monitored, monitored burn-in is most effective in detecting re- 
liability defects. It requires application of test vectors during burn-in. The 
number of available input /output lines in a burn-in oven increases the price 
of burn-in ovens significantly, and hence, monitored burn-in can be fairly ex- 
pensive. Currently, there are no known low-cost alternative to burn-in. 
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In this paper we present a test methodology which can potentially weed out 
some of the latent defects which are activated by higher current densities or 
localized current stress while achieving a high fault coverage for the stuck-at 
faults. Stress testing is a new proposed method which involves running stress 
test patterns on a circuit for an extended period of time, such that units with 
infant mortality problems fail. The stress test patterns should be such that 
they activate all parts of the circuit to create high current densities. We have 
developed a method where we use an already existing test set and reorder it 
to achieve our goals. If stress testing needs to be performed for a long period 
of time, then the same test set can be repeated. 

The stress test methodology can also find application in testing of dies for 
Multi-Chip Modules (MCMs). To achieve high yield for the modules, it is 
essential to use the Known Good Dies (KGD) [2]. Several researchers [3,4] 
have suggested that wafer-level and die-level stress testing should be used 
to achieve the KGDs. Hence, in the absence of die-level burn-in, the stress 
testing techniques can be used to determine the known good dies. The testing 
technique can also be used during monitored burn-in [5,6,7] to enhance the 
localized current stress in parts of the circuit. 



1,2 Current Increase in CMOS Circuits 

For functional verification of a circuit, test vectors are generated targeting a 
particular fault model such as the stuck-at faults. Stuck-at faults have been 
traditionally used to model manufacturing defects [1]. To achieve high defect 
coverage, the test vectors should be such that 100% fault coverage is achieved 
for the list of faults under consideration. In our test methodology, we reorder 
the input stuck-at test vectors such that a high current density is achieved in 
a predefined portion of the circuit, while testing for stuck-at faults. In CMOS 
circuits, majority of the current flow is due to the signal switching. Hence, by 
reordering the stuck-at test vectors we try to increase the switching activity, 
and hence the current, in certain specified regions of the circuit. 

The estimation of average current due to switching activity (average num- 
ber of transitions per unit time) has been considered in [10,11,12] where the 
switching activities at the internal nodes of a circuit are estimated using prob- 
abilistic techniques, based on input signal probabilities (probability of having 
a logic ONE) and switching activities. Experiments suggests that average 
switching current can be accurately estimated using the signal activity mea- 
sure at the internal nodes. Recently Chou et. al. [8] considered scheduling of 
tests so as to minimize power dissipation. They provide theoretical results for 
the power constrained test scheduling problem. The objective was to minimize 
total test length subject to the power constraint. In this paper we present a 
methodology to selectively control current flow in sections of the circuit while 
testing for stuck-at faults. 
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2 PRELIMINARIES 

In this section we describe a model for estimating the average current in 
CMOS circuits. The current drawn by a CMOS circuit is intimately related 
to the power dissipation. We will first estimate the average power dissipation 
of a CMOS multi-level combinational logic circuit due to signal activity. 
Power Dissipation in CMOS: The three sources of power dissipation in 
CMOS circuits are due to leakage current, short-circuit current, and switching 
current associated with charging or discharging of load capacitances. The 
latter component accounts for majority of the power dissipation in CMOS 
and will only be considered in our discussions. The capacitances internal to 
a logic gate are assumed to be sn^all and are neglected in our analysis. The 
average power dissipation in a multi-level logic circuit is given by 




Vl^AiCi 

t = l 



( 1 ) 



where Vdd is the supply voltage and is constant, C{ is the capacitance associ- 
ated with node i of the circuit, and A,* is the activity at node i. A,- is equal 
to the average number of transitions per unit time. The summation is taken 
over all the n nodes of the circuit. From Equation 1 it is clear that the average 
current Idd drawn from the supply voltage due to the switching component of 
power is equal to Idd = Hence, both average power dissipa- 

tion and switching current are proportional to the weighted switching activity 
given bjr AiCi. 

For a gate level description of the circuit, the load capacitance of each gate 
can be approximated by the fanout times the transistor input capacitance. 
However, estimation of signal activity at the nodes is not trivial. All pub- 
lished methods of estimation of signal activity involves estimation of signal 
probability, which is the probability of a signal taking a logic value of ONE. 
If the primary input signal probabilities and activities are known, probabilis- 
tic or simulation based techniques (10,11,12,13) can be used to estimate the 
activities at internal nodes of a circuit. 

In this paper, the exact input stimulus to the circuit under the test mode 
is known. Hence, a simulation-based technique can be used to obtain the 
actual activity at the internal nodes of a circuit. It should be observed that if 
accurate delay models for the logic gates and interconnects are available, then 
this technique will be able to accurately determine the spurious transitions at 
the internal nodes of the circuits. The spurious transitions causes switching of 
internal nodes and hence, dissipates power. The details of the power estimation 
technique during input reordering is given in Section ??. 
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The maximum current is important when the circuit has to be stressed to 
its limit. From Equation 1 we can observe that for a given set of stuck- at test 
vectors V, the condition for maximum current is given by 

n 

maximize Vk,viEV 

t=i 

where 7^*’*^* is given by 

® followed by vi produces no transition at node i 

* ~ 1 1 followed by vi produces a transition at node i 

Vk and vi represents two stuck-at test vectors from set V. The term Jl*'*’^* 
is similar to activity defined in Equation 1. If one considers the nodes of a 
circuit to have finite delays, then glitches can occur. For such transitions, the 
range of values of can be larger than 1 based on whether no transition 
or a functional or hazardous (static or dynamic hazard) transition is present 
at node i. 

Thermal Modeling of ICs: Most of the electrical energy consumed by 

an IC eventually appears as heat. An increase in device temperature can be 
achieved by dissipating more heat through increased activity. The accelerated 
aging of an IC can be expressed as an exponential function of the junction 
temperature as: 

= <0 exp [I (i - i)] 

where 



Ia = lifetime at elevated junction (7}) temperature 
to = normal lifetime at normal junction (To) temperature 
E = activation energy (ev) 

k = Boltzmann’s constant (8.617x10”^ ev/K) 



For a certain process, the lifetime of an IC could be decreased by a factor 
of 2 for every lO^C increase in temperature [18]. Units with infant mortality 
problems have a low lifetime, which can be shortened further by increasing 
temperature through stress testing, thereby weeding them out with a reason- 
ably long stress testing time. 
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3 GRAPH-BASED FORMULATION OF THE PROBLEM 

The problem of reordering test vectors to achieve the required current stress 
can be formulated as a graph traversal problem [14] for efficient solution. To 
better understand the procedure, let us consider the following definition. 



Definition 1 (Hamiltonian Path) A graph G(V, F, W) consisting of a set 
of nodes V, a set of edges E and a set of weights W (each edge associated with 
a weight) is said to have a Hamiltonian path if there is an ordering of nodes 
< vi, V 2 , . . . , Vm > ofGf where m =| 1/ |, such that, < Vi,Vi^i >G E for all 
i, 1 < I < m. The ordering < vi, ^ 2 , . . . , Vm > represents a Hamiltonian path 
of graph G. The cost of the Hamiltonian path is given by the summation of 
weights associated with each edge of the Hamiltonian path. 



Let us consider a set of stuck-at test vectors V of size m for a combinational 
circuit C. In order to determine the switching activity due to the application 
of the test vectors we consider a graph G{V, E, W) whose nodes are given by 
the set V. The cardinality of set V is m, which is the number of test vectors to 
be applied to circuit C under test. E is the set of edges in the graph, while W 
represents the set of weights associated with each edge. For example, an edge 
connecting nodes Vi G V and vj G is given by eij G E and represents that 
test vector vj is applied to the circuit after application of vector v, . Each edge 
is also associated with a weight Wij G W. Let us also assume that the graph is 
fully connected, i.e. there exists an edge between every pair of vertices v,- and 
Vj, (f j). Such graphs are referred to as cliques. The weight Wij associated 
with each edge is given by the switching activity over all circuit nodes due 
to application of test vectors t;,- followed Vj . Hence, Wij is proportional to the 
current associated with the application of input vectors V{ and vj , respectively, 
and is given by 






k=l 



( 2 ) 



where Ck is the capacitive loading associated with node k of circuit C having 
n number of nodes. Assuming the transistor input capacitance to be constant 
for all transistors of a design, Ck can be approximated by the number of 
fanouts at node k. 

For graph G{V, E, W), each node represents a test vector Vk for the CMOS 
circuit C, and each edge Cij represents the current flow or power dissipa- 
tion when vectors vj is applied after the application of vector Vi. Hence, a 
Hamiltonian path of the graph G starting from node v\ and ending in Vm 
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Figure 3 Graph G for the 4 test vectors of the circuit in Figure 2 



< ^ 2 ,^ 3 , > represents an ordering of test vectors for which the av- 

erage current is proportional to 



P = 







This suggests that the maximum or minimum length Hamiltonian path is an 
ordering of test vectors which produces maximum or minimum average cur- 
rent, respectively. Let us consider an example. Figure 3 shows the graph G 
for the four test vectors applied to a 1-bit full adder of Figure 2. The weight 
associated with each edge represents the actual simulation result of applying 
two test vectors to the adder circuit. For example wu represents the switch- 
ing activity when test vector 2 is applied after application of vector 1. The 
capacitance associated with each node of the full adder is approximated by 
the number of fanout at that node. Glitches were neglected. From Figure 3 it 
is clear that the maximum Hamiltonian path is < , U 2 , V 3 , V 4 >. Hence, such 

a sequence of test vectors will produce the maximum average current. While a 
minimum Hamiltonian path, represented by < ^ 2 ,^ 4 , vi,t ;3 > produces mini- 
mum average current. 

The problem of determining the maximum or a minimum Hamiltonian path 
or a path of length close to a specified value, is similar to the traveling salesman 
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problem for which there exists no known polynomial time algorithm [14]. 
Hence, we resort to approximate methods of solution. 



3.1 Reordering for Maximal or Minimal Activity 

The graph G(V, J5, W) can be represented by an m x m matrix M such that 
the (i,i)th entry Mij is equal to Wij, Mu is undefined. We use a greedy 
algorithm to determine the reordered test vectors for maximal or minimal 
current stress. A very similar procedure can be used for minimizing the average 
current. 

Algorithm 1: 

From StartNode = 1 to m { 

Generate matrix M; 

Activity = 0; 

MaxActivity = 0; 
i = StartNode; 

Repeat m times: { 

Select largest Mik from row i; 

Activity = Activity + M, j ; 

Delete row i and column j from M; 
i = k; 

} 

MaxActivity = Activity, if Activity > MaxActivity; 

} 

Starting from a row r (test vector or node) of matrix M the greedy algorithm 
selects the largest (smallest) entry in that row. That takes time 0(m). The 
row (r) and column (c) corresponding to the maximum (minimum) entry 
is removed from matrix M. Note that c represents the next node visited 
from node r. Row c is selected next for the above operation. The process is 
repeated m times to determine a Hamiltonian path starting at node StartNode, 
The entire process is again repeated with a different starting node for the 
Hamiltonian path. Hence, the algorithm takes 0{m^) time. 



3.2 Reordering for Desired Activity 

In order to achieve a desired circuit activity by test reordering, a procedure 
similar to the one described above can be used. If the desired activity is £), 
then we define the average activity per node^ Da = D/m, where m is the 
number of nodes of graph G{V,E, W), Instead of selecting the largest 
as shown in Algorithm 1, the value of Miki which gives the closest running 
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average to Da is chosen. This approach leads to reordering such that activity 
is very close to the desired value. In case the desired value of activity is higher 
or lower than the maximal or minimal value as obtained in Section 3.1, the 
algorithm will return the maximal or the minimal value, respectively. A sketch 
of the algorithm is given below. 

Algorithm 2: 



From StartNode = 1 to m { 

Generate matrix M; 

Activity = 0; 

DesiredActivity = LargeNegative; 
i = StartNode; 

Repeat j = 1 to m times: { 

Select Mik from row i such that: 

^ is the smallest; 
Activity = Activity + Mij; 

Delete row i 20 id column j from M; 



} 



i = k; 

} 

DesiredActivity = Activity, if abs (DesiredActivity -Da) 
< abs (Activity -Da); 



3.3 Reordering for Localized Current Stress 

Hnatek in [7] observed that defects such as epitaxial and crystal imperfections, 
metallization, oxide, or junctional anomalies may require localized current 
stresses for the activation of the failure mechanism. In this subsection we will 
consider reordering of stuck-at test vectors to achieve such a stress. 

Let us first consider the case when a chip is partitioned into two parts: Parti 
and Partn. In order to have a large current gradient between Part\ and Part 2 ^ 
the activity (weighted by capacitance) in one part should be maximized while 
the activity (weighted by capacitance) in the other part should be minimized. 
The activities weighted by the corresponding capacitances in the two parts 
can be easily determined by applying a test vector v, followed by vector vj 
and by noting the number of nodes undergoing transition. Hence, one can 
construct two completely connected graphs Gi(V, E, W^) and G 2 (V, E, 
where the respective edges eL and between nodes V{ and vj {vi^vj G V) 
are weighted by the activity from the corresponding parts. It should be noted 
that the number of vertices in each graph Gi and G 2 is equal to the number 
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of test vectors. The difference between the weights associated with edges e]j 
and Cij is equal to 

Afj = w}j - 



which signifies the difference in activity between the two parts, when two 
test vectors v,* and vj are applied in sequence. Now, it is possible to con- 
struct a completely connected graph Gs{Vy JS, W^) such that each edge weight 
wfj = A, j. Using the algorithms described above, it is possible to find a Hamil- 
tonian path which is maximal, minimal, or of a specified length. The maximal 
Hamiltonian path corresponds to the ordering of test vectors which produces 
the maximal current gradient between Parti and Part 2 ^ while the specified 
weight produces the specified current gradient between the parts. 

Let us consider the more general case when a circuit is partitioned into 
p parts. It is possible to generate graphs Gi(V, J5, ..., Gp(V, for 

each of the parts such that the weight w^j associated with edge e^j for part 
Partk is equal to the activity in the part due to the application of test vectors 
Vi followed by vj . Each part k can be associated with a stress weight Sk which 
specifies the relative stress that each of the parts are required to experience. 
For example, in case of the two-way partitioning as described above, s\ and 
$2 should be 1 and —1 respectively, to achieve maximum current gradient 
between the parts. The edge weights in each graph Gk{V, E, W^) is multiplied 
by Sk to include the relative stress weighting factor into account. The new edge 
weight nwfj is given by 

nwij = tUfj * Sk 



A modified graph Gs{VyE, W^) is constructed such that weight of each edge 
is given by 



The maximum Hamiltonian path in graph Gs{V^E^W^) produces the near 
optimal stress across the different parts. 

Circuit partitioning for stress testing can only be done after the chip layout. 
From the chip layout, the reliability engineer can determine the sections of the 
chip across which high current gradient is required. Based on such information, 
it is possible to determine the logic gates in each part. We have taken a layout 
of a 4X4 multiplier and performed such stress testing by partitioning it into 
three parts. 
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Figure 4 Test lengths for 100% test efficiency 



4 IMPLEMENTATION AND RESULTS 

The algorithms described earlier have been implemented in C on SUN work- 
stations. The circuits used for experimentation are full-scan version of a subset 
of the ISCAS-89 [17] benchmark suite, which are used extensively in the VLSI 
CAD (Computer Aided Design) research for evaluating any new technique in 
testing and synthesis. The circuits were run through an automatic stuck-at 
test generator, TRAN [16], to get test patterns with 100 % test efficiency. 
These test patterns were the target for reordering such that the required cur- 
rent stress could be achieved. Figure 4 shows the test lengths (or the number 
of test vectors) of the circuits for 100 % test efficiency. 

Figure 5 graphically shows the comparison between minimum and maxi- 
mum current drawn by a circuit by reordering of the input patterns. It is 
interesting to note that current requirements during testing can be changed 
by more than 300% by reordering the test vectors. 

Table 1 shows the performance of our algorithm when a desired activity 
is required during testing of a VLSI chip. The results based on Algorithm 2 
shows that it is possible to control the circuit activity within 5% of the desired 
value. 

In order to see the effect of reordering test vectors to create electrical stress 
across different partitions, we considered a multiplier circuit. Figure 6 shows 
the layout of a 4x4 multiplier. Test vectors were generated for the multiplier 
using TRAN [16] for 100% coverage of stuck-at faults. Table 2 shows the 
results of reordering the test vectors to create the maximum stress gradient 
between different partitions of the layout of Figure 6. The activity ratio refers 
to the ratio of activity in the two partitions under considerations. Results 
on three different partitions have been presented. Results show that a large 
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Figure 5 Minimum and maximum current drawn by circuits 




Figure 6 A 4x4 multiplier layout 
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Table 1 Comparison of desired and actual power values 



Design 


Desired 


Actual 


Name 


Activity 


Activity 


s208 


800 


861 




1700 


1699 




2100 


2082 


s298 


3000 


3001 




4000 


3965 




5000 


4812 


s510 


5000 


5099 




7000 


7000 




10000 


9951 



Table 2 Results for the multiplier example 



Partition 


Activity ratio 


Parti 


10.69 


Part2 




Parts 


5.12 



activity gradient can be produced based on the stuck-at test vectors of the 
circuit. 



5 DISCUSSIONS AND FUTURE DIRECTIONS 

In this paper we have presented a novel approach to reordering of stuck-at 
test vectors to create electrical stress during functional testing of the circuit. 
Such electrical stress can activate some latent defects and can weed out defects 
such as epitaxial and crystal imperfections, metallization, oxide, and junction 
anomalies. The technique can also be used to obtain the Known Good Dies for 
MCMs when die-level burn-in is not available. The methodology is simple, and 
utilizes existing test patterns for any good fault model, thus rendering itself to 
be a low-cost alternative to well established burn-in technique. The proposed 
method should be investigated further to answer some practical questions. 
What should be the duration of stress test? What kind of circuits are more 
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amenable to stress test? Also, what range of improvement in reliability can 
be expected by application of stress test? 
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Abstract 

In this paper, a non-Monte Carlo approach is presented for the simulation and sensi- 
tivity analysis of linear(ized) analog circuits under parameter variations. With uncer- 
tain parameters represented as intervals, we present a formulation of circuit equations 
as a set of interval linear equations. An elegant and efficient algorithm is developed 
based on some interval-mathematic results derived in this paper. In contrast to Monte 
Carlo simulation, the proposed approach is rapid — with the computational cost com- 
parable to one nominal circuit simulation — and correct — the computed bounds al- 
ways contain, and are usually close to, the actual bounds. Further, sensitivity under 
parameter variations can be computed from the response bounds with minor com- 
putation cost. The algorithms have been implemented into SPICE3F5. Experimental 
results are promising. 



1 INTRODUCTION 

Recent developments in microelectronic manufacturing technologies and system de- 
sign methodologies have created a new set of challenges to circuit simulation. This 
has been exemplified in three aspects: First, the complexity of microelectronic sys- 
tems being designed is increasing constantly in the number of components being 
integrated. As a result, simulation and verification are taking enormous amount of 
computation time. Second, the types of components found in a chip (system-on-a- 
chip) are becoming radically diversified, e.g., electronic devices, as well as optical 
devices, micro-electro-mechanical (MEMS) devices. As a consequence, circuit pa- 
rameters are more subject to manufacturing process variations. For example, em- 
bedded passive components (resistors and capacitors) in mixed-signal multi-chip 
modules have a typical variation of ±10% [2]. Parameter uncertainty arises also 
in multi-abstraction/mixed-mode simulation. Higher level abstraction usually repre- 
sents some uncertainty to lower level simulation. For example, unknown states in 
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digital portion give rise to parameter uncertainty in analog portion. Finally, the time- 
to-market pressure and life-cycle-costs motivate more use of simulation as a tool not 
only in the design process, but also in the test development stage. In this paper, we 
consider how to simulate efficiently electronic circuits under parameter variations. 

Simulation under parameter variations was studied early in the area of design for 
manufacturing. Since only extreme conditions are considered, it is known as worst- 
case tolerance analysis [8, 11, 12]. Existing approaches can be classified into three 
categories: vertex enumeration, sensitivity-based estimation, and Monte Carlo sim- 
ulation. Vertex enumeration is to simulate all the possible extreme parameter cases. 
However, extreme values of an output response may not necessary occur at the ver- 
tices of parameter space. Even if the extreme value occurs at the vertices, as this is 
the case of resistive networks [1], finding such a extreme value is NP-complete, as 
proved by Huang and Bryant in the context of switch-level simulation [7]. 

The sensitivity-based approach is to calculate the worst-case response via sensitiv- 
ity estimation. Suppose that / is the function, and Xi is a statistical parameter. Then 
the upper and lower bounds of / under parameter variations can be estimated as as 

where /o is the nominal value of /. This method is not computationally expensive. 
However, it is not predictable whether the results underestimate or overestimate the 
actual bounds. 

With the Monte Carlo technique, simulation is repeated for random combinations 
of values chosen from within the range of each parameter. Unfortunately, determin- 
ing accurate bounds on the behavior of a circuit requires a large number of simu- 
lations to be effective. Therefore, its use is limited to small circuits that have a few 
statistical parameters. In addition, the Monte Carlo method always underestimates 
the response bounds. 

In this paper, we present a rapid, correct and conservative approach to frequency- 
domain simulation and sensitivity analysis of linear(ized) analog circuits and sys- 
tems under parameter tolerances. A large class of circuits and systems widely used 
in video and image processing, digital signal processing, control, communications, 
and many other applications falls into this category. The proposed method is based on 
interval mathematics. We note that the use of interval analysis to circuit simulation 
was not a new idea. However, as pointed out by Zukowski [14], “the field of interval 
arithmetic, which has often addressed very general problems, has developed a poor 
reputation”. Previous attempts have not produced satisfactory results [8, 1 1]. Our ap- 
proach is based on two techniques described in this paper. First, a novel formulation 
of circuit equations, called generalized nodal analysis, is developed. Second, we de- 
rive a theoretical characterization of interval linear systems. Based on the theory, an 
elegant and efficient algorithm is developed. Our work is inspired by the recent work 
of Hansen [4]. 

This paper details our theory and methods, and presents a prototype analog circuit 
simulator utilizing the proposed techniques. An interval-based theoretical framework 
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for handling parameter variations is introduced in Section 2. Under this framework, 
circuit equations are formulated by generalized modified nodal analysis as a set of 
interval linear equations in Section 3. A theory is developed in Section 4, which 
leads to an elegant and efficient algorithm described in Section 5. Sensitivity under 
process variations are obtained as a by-product with virtually no extra computation 
cost, as shown in Section 6. Implementation and experimental results are described 
in Section 7. Section 8 concludes the paper. 



2 NOTATIONS FROM INTERVAL MATHEMATICS 

In this section we present some concepts and notations for later use from interval 
mathematics. 

Let p G be a real number whose value may not be precisely known. Instead, we 
are often given a range and p is uncertain within this range. This can be represented 
by an interval number p^, with lower (left) bound and upper (right) bound p^, 
denoted by p^ = [p^, p^]. The midpoint mid{p^) of an interval p^ is defined 
as mid{p^) = Up^ -f- p^), and the radius rad(p^) of p^ is defined as rad{p^) = 

An interval vector is a vector whose elements are interval numbers, and we 
write an interval vector as = [x^ , x^]. An interval matrix A^ is a matrix whose 
elements are interval numbers. We write an interval matrix as A^ = [A^, A^j.The 
midpoint and radius of an interval vector and matrix are defined similar to that of an 
interval number. 

We say an interval number p^ contains , denoted as C p^, iff p^ < < 

qR^pR interval vector x^ or matrix A^ contains interval vector or matrix 
is every element of x^ or A^ contains the corresponding element of or B^. 

An interval linear equation, A^x = b^, is defined as the family set of linear 
equations: 

{Ax = b|AG A^andbebO* 

We are interested in the solution set, given by: 

5 = {x = A“^b I A e and b G b^}. 

In general, the solution set 5 is a nonempty convex subset in The solution hull 
is defined as the tightest interval enclosing S, represented as; 

(A^)^b^ = D{x = A“^b I A G and b G b^. 

Here □ is to take the hull of a non-empty set, and (A^)^b^ denotes the result of 
mapping (A^)^ applied to b^. The projection of a solution hull into any particular 
dimension is an interval. With a slightly abuse of notation, we use x^ to denote the 
solution hull, and refer to xf^ and xf as solution bounds. 
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3 GENERALIZED MNA FORMULATION OF CIRCUIT EQUATIONS 

One well-known peculiar factor of interval mathematics is that the order of opera- 
tions affects the solution quality significantly. In this section, we present a specific 
formulation of circuit equations such that its later solution yields tight bounds. 

Consider a linear time-invariant circuit, where some circuit parameters are un- 
certain, and represented by interval numbers. The Modified Nodal Analysis (MNA) 
method [6] is adopted with slight modification to formulate the circuit equations. 
Basically, for all the components which do not have interval parameters, the rules 
of MNA are followed. For every component with an interval parameter, an addi- 
tional variable (branch current) and an additional equation (branch equation) are in- 
troduced. This is called generalized MNA formulation, or simply Generalized Nodal 
Analysis) ( GNA ). In general, the GNA formulation results in a set of interval complex 
linear equations, represented as: 

tV = v/. (1) 



Interval complex equations in Eq. (1) can be rewritten as a set of interval real 
equations as follows: 



V Ti j V 4 j V ^ W 



( 2 ) 



where subscripts 11 and I denote, respectively, the real part and the imaginary part of 
a complex matrix, a complex vector, or a complex number. Therefore, under our for- 
mulation, frequency-domain circuit simulation under parameter variations amounts 
to solving Eq. (2) — a set of interval linear equations — for a given set of frequency 
points. 



4 PRECONDITIONING OF INTERVAL LINEAR SYSTEMS AND ITS 
CHARACTERIZATION 

In this section, we present a formalization of some original ideas described by Hansen [4] 
for solving the system of interval linear equations in the form of 

A^x = b^. (3) 

Let C be the of interval matrix A^, i.e., C = |(A^+A^). Pre-multiplying 

Eq. (3) with C“^ transforms the original system into a new system; 



M^x = 



(4) 




544 Part Twelve Testing in Complex Mixed Analog and Digital Systems 



where and = C“^b^. We refer to this transformation as precondi- 

tioning, and call Eq. (4) the preconditioned system. 

An assumption used in this paper is as follow: 

Assumption 1 The preconditioned interval matrixM^ is regular, i.e., every compo- 
nent matrix M, M G M^, is nonsingular. 

The original matrix A is said to be strongly regular if Assumption 1 is satisfied. 

Proposition 1 The solution hull of the preconditioned system contains that of the 
original system, i.e., (A^)^b^ C 

Proposition 2 The solution hull of the preconditioned system contains the right 
hand side vector, i.e., r^ C (M^)^r^. 

Proposition 3 The preconditioned matrix is centered around the identity matrix 

1, i.e., = -Mf< > OJori ^ j and = 2. 

Since does not contain a singular matrix (Assumption 1), and is centered 
around the identity matrix (Proposition 1), it follows: 

Proposition 4 The preconditioned matrix satisfies the following properties: 

0 < < 1 1 < < 2 (5) 

Theorem 1 The solution set of the preconditioned system (4) is defined by the fol- 
lowing set of inequalities: 

M^|x| < Ymid(r^) -f rad(r^) (6) 

M^|x| > Ymid{r^) — rad{r^) (7) 

where Y is a diagonal matrix such that 

( 0 ifi ^ j 

Yij = \ 1 ifi = j citid Xi > 0 

[ -1 ifi = j and Xi < 0 

Theorem 2 For the preconditioned system (4), the maximum and minimum non-zero 
values of\xi \ are determined respectively by the following two sets of equations: 



M^|x| = \mid{r^)\-\- rad{r^) 

(M^ — 2e,ef)|x| = (I — 2e,ef )|mic/(r^)| + rad(r^) 



( 8 ) 

(9) 
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where e,- is the ith unit vector whose ith component is 1 and all other components 
are 0, and superscript T denotes the transpose. 

5 AN ALGORITHM FOR SOLVING PRECONDITIONED INTERVAL 
LINEAR SYSTEMS 

In this section, we describe an algorithm for solving a set of preconditioned interval 
linear equations, based on the theory developed in the last section. 

For notational convenience, let us define a vector s G 7^” such that 

Si — \mid[rl)\ + rad[rl), 

and G such that 

^(i) _ f rad{r]) - \mid{r])\ \f j = i 
3 ^ Sj Otherwise 

Let us define P = and 

n 

fi = ^ ’ 

i=i 

Then from Theorem 2, the maximum value of |a;, | equals /, . Now we consider how 
to compute the minimum non-zero value of |. According to Householder' s inverse 

matrix formula [5], we have: 

(M^ - 2e,ef )-^ = P - 2c,Pe.ef P, 

where c< = 1/(2F,-,- — 1). Hence, the minimum value of |a;, | is equal to 

ef(M^-2e,ef)"4(’) = ef (M^ - 2e,ef 

= ef (/ - 2c,Pe<ef )PtW 

= 

j=i 

= -CiQi. 

Now we are ready to present the complete algorithm for computing the solution 
hull of the preconditioned system. The algorithm is summarized in pseudo-code in 
Fig. 1. 




546 Part Twelve Testing in Complex Mixed Analog and Digital Systems 



Preconditioned.System.Solve(M^x^ = r^) 

1 P<-(M^)-1 

2 for 2 = 1 to n do 

3 Si 4- \mid{rj) \ + rad{rf) 

4 for 2 1 to n do 

5 /i ^ E"=i 

6 9i fi - ^Pii \mid(rf ) | 

7 if </i > 0 

8 itmid{r{)>Q 

9 

9 else 



11 


^i <- [-fi,9i] 


12 


else 


13 


if mid{rf) > 0 


14 


xi ^[-gi/{2Pu-l),fi] 


15 


else 


16 


4 ^ <7, 7(2P.i - 1)] 


17 


return 


Figure 1 An algorithm for solving preconditioned interval linear systems. 


Theorem 3 Algorithm Preconditioned.System.Solve computes the exact 
lution hull of the preconditioned system (4) with time complexity 0(n^). 



6 SENSITIVITY COMPUTATION UNDER PARAMETER 
VARIATIONS 

Sensitivity is an important tool in many contexts of circuit design and test [10, 12]. 
We describe here a method to compute sensitivity under parameter variations with 
minor computational cost. 

Consider the circuit equation set shown in Eq. (1). In order to evaluate the sensi- 
tivities of all components of the vector to a single parameter pf , we differentiate 
Eq. (1) with respect to pf to obtain: 

dpi dpi dpi ■ 

Because vector does not depend on pi, the equation set above can be rewritten as 




dT^ 

dpi 



X 



I 



Note that in the GNA formulation, every uncertain parameter appears only in one 
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matrix entry. Suppose matrix entry is for component pf , then we have: 



dpi dpi dtig 



Note that the complex equation set Eq. (1) is converted to a real equation set Eq. (2) 
for solution. Each matrix entry in Eq. (1) leads to two matrix entries in Eq. (2). 
Suppose that for tpq, the two corresponding matrix entries in Eq. (2) are Oki and 
amn» then the above equation set can be transferred to a real equation set as 



A' 



dx^ 

dpi 




( 11 ) 



where dt^q/Opj is very easy to obtain from our GNA formulation method and: 



bi 



xj if 2 = fc and pj is real number 
if 2 = m and pj is real number 
< —xj if 2 = fc and pj is imaginary number 
-xj^ if 2 = m and pj is imaginary number 
0 otherwise. 



Now sensitivity computation under parameter variations amounts to solving in- 
terval linear equation set Eq. (1 1). Note that Eq. (1 1) has the same form as Eq. (3): 
only the right hand side is different. Since Eq. (3) has been solved by algorithm 
PreconditioneD-SysteM-Solve in Section 5, the vector is known. Thus the 
right hand side vector can be easily generated. Since the inverse matrix P is already 
available, solution of Eq. (11) involves executing Lines 2-16 in the algorithm, which 
takes time linear in the number of non-zero entries in P. 



7 EXPERIMENTAL RESULTS 

The proposed algorithms have been implemented into a computer program SIVA- 
AC using the SPICE3F5 sparse matrix package. A number of experiments have been 
conducted mainly to test the quality of computed interval bounds and validity of 
sensitivity computation. In this section, we describe two examples. All CPU data are 
collected on a SPARC-Ultra 1 workstation. 



7.1 A Two-Amplifier Active Network 

An active network shown in Fig. 2 is used to compare the quality of response bounds 
produced by the proposed interval method with those by the existing Monte Carlo 
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method, ad well as the sensitivity-based method. The nominal parameter values are 
given as 



= 1, i?2 = 1, Cl = 1, C 2 = 1, Ei = 1.3784, E 2 = 1.3784. 
All the parameters are assigned ±1% statistical variations. 




Figure 2 Circuit schematic of a two-amplifier active network. 
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Figure 3 Real part of K(5). 

Figure 3 plots the results of the real part of V (5) computed by various methods. 
We observe that, the Monte Carlo simulation (10"^ trials) always under-estimates the 
accurate bounds (approximated by 10^ trials Monte Carlo simulation). Our interval 
method always over-estimates the accurate bounds. The nominal-sensitivity-based 
method may over-estimates the actual bounds in some frequency ranges (e.g., lower 
bounds between 0,145hz - 0. 155/iz), and may under-estimate that in some other fre- 
quency ranges (e.g., upper bounds between O.lbbhz - O.l&hz). Our interval method 
is superior to others, since the computed bands always contain the actual bands, and 
further the differences are usually small. 
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7.2 /xA 741 Operational Amplifier 

We further test our program on the the more complex /xA 741 operational amplifier, 
shown in Fig. 4. The nominal values of circuit parameters are 



Ri = Ik, i?2 = 50fc, Ra = Ik, Ra = 3fe, i?5 = 39fc, Re = 50 
R^ = 25, Re = 100, = 50A:, Riq = 40k, Rn = 50k, COMP = 30p/. 

All the parameters are assigned ±5% statistical variations. 




Figure 4 /xA741 operational amplifier circuit. 

Figures 5(a) and 5(b) plot the output bands of V {OUT) under parameter varia- 
tions computed by our method, as well as by Monte Carlo. For 80 frequency points, 
SIVA-AC took 68.2 second CPU time, Monte Carlo (10000 trails) took 12000 sec- 
onds. Figures 6(a) and 6(b) show the sensitivity of V{OUT) with respect to C comp 
under parameter variations computed by our method, as well as the nominal sensitiv- 
ity computed by SPICE3F5. We observe that the computed sensitivity bands contain 
the curves of nominal sensitivity, and further, the shapes of sensitivity bands match 
those of the nominal sensitivity. For 80 frequency points, SIVA-AC took 22.7 sec- 
onds, while SPICE nominal simulation took 24.5 seconds. 



8 CONCLUDING REMARKS 

An interval-based approach was presented for fi-equency-domain simulation and sen- 
sitivity analysis of linear(ized) analog integrated circuits and systems under param- 
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(a) magnitude of V(OUT) 



(b) phase of V(OI7T) 



Figure 5 Simulation results under parameter variations. 




(a) real part of V (OUT) (b) imaginary part of V (OUT) 



Figure 6 Sensitivity computation results under parameter variations. 

eter variations. Experimental results have demonstrated that the proposed method is 
extremely fast, and usually produces the bounds very close to actual bounds. We at- 
tribute this to both a careful formulation of circuit equations and an elegant algorithm 
developed on a series of theoretical characterization derived in this paper. 

We note that what we addressed in this paper is circuit simulation under param- 
eter variations. An application of this technique to analog fault simulation appears 
in [13]. The statistical variations are completely ignored in our framework. As a con- 
sequence, this method is not applicable to the situations where statistical distribution 
is significant such as yield estimation [3]. 
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Abstract 

The evolution of microeletronic design techniques implies an important 
methodological effort to increase the degree of abstraction far from technology and 
devices. This process is not uniform and depends on the specific electronic 
domains. As for sensors and actuators, there has been quite a few approaches 
toward this global system design methodologies, and most of them in the last 
years. The main points for this relative delay are: non-standard technological 
processes , specially in case of embedded microsystems, different types of analysis 
required, that use different environments for each magnitude (mechanics, flow, 
magnetic,..), connection between circuit and structural 3D simulators , whose 
representations need to be abstracted by introducing the concept of sensor cell 
libraries, to estimate performance in the global system (large areas, signal 
sensitivities, mechanical restrictions,...). In the paper, we show the choices we 
select for these decisions in the case of a new microsystem fabrication process for 
embedded microsystems, developed together with the tool and design 
methodology, based on the 1.0pm CMOS from ATMEL-ES2. Some demonstrators 
are presented, together with measurement results. 
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1 INTRODUCTION 

Sensors and actuators have been driving the evolution of electronics, since they are 
present in most of the industrial products. This large market suffered from an 
impressive evolution in the last years due to the requirements of new applications 
like safety in cars, water saving and water pollution, spatial applications, home 
systems, etc. Often, those sensors are not interfaced with simple electronics, but 
with complex systems through specific buses. Also, area and weight need to be 
reduced for many applications. 

All these requirements, directed the evolution of microsystems technology 
during the last years, Baltes (1193), with different alternatives: monolithic 
implementations , where sensor and electronics share the same silicon substrate, 
multi-chip modules MCM . which contain several specialized chips for sensing and 
processing, and **classical” PCB-based system. 

From the point of view of a system designer, the ideal tool has to offer the 
possibility to perform a given analysis of the system modeled at very high level, 
while keeping the main constrains of the implementation (price, speed and power 
consumption). In order to minimize cost functions for complex systems the 
designer has to work even at low levels of abstraction, down to full-custom level. 

For the case of microsystems design, Huijsing (1994), different types of analyses 
have to be carried out to ensure stability (i.e. AC analysis) and performance (i.e. 
DC analysis) of sensors. Most of that work is being done by structural 3D 
simulators. 

Interactions between different subsystems (sensor, analog and digital) have to be 
taken into account in the design process to produce the whole system. The 
complexity, at the system level, of digital parts is driving the global verification 
process, usually through transient simulation. This analysis has to be enough 
accurate to cope with small sensor magnitudes and dependencies, what usually lets 
to mixed-mode multi-level simulation, thus higher levels of description for sensors 
have to be available to system designers. 

In this paper, we present our developments that concurrently try to develop a 
new approach for monolithic microsystems integration from both technology and 
design sides. In chapter 2, different sensor types are described, that correspond to 
different technological developments (related to different process steps). In chapter 
3, the design methodology is presented, following the steps of a “standard” 
bottom-up design methodology: full-custom, sensor cell libraries and semi-custom. 
Chapter 4 shows some circuits and the measured results for devices and 
microsystems, including a demonstrator designed for the prototype washing 
machine from Fagor Sensores (COPRECI). 
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2 SENSOR TYPES 

The standard process used is 1.0 pm CMOS N-well from ATMEL-ES2, with one 
polysilicon and two metal layers. A post-processing technology has been designed 
at CNM, to allow fabrication of devices based in micromachined membranes, of a 
thickness up to 10 pm. Membranes are obtained by anisotropic etching from the 
back side of the wafer, using an additional mask, and a double-side aligned 
photolithographical step. With a further step, it is possible to etch devices from the 
front side, and then spring-mass structures can be obtained. 

Pressure sensors 

Piezoresistive pressure sensors have been fabricated by using this post-processing 
technology. The piezoresistors are obtained from the standard PMOS implant used 
for transistor source and drain, for thicker membranes (2,5 to 10 pm) and from 
doped polysilicon for thinner ones (up to 2,5 pm), e.g. Marco (1993). Pressure 
sensors with 600x600pm^ membrane, 5 pm thickness are currently working for a 
0.3 bar range, with 20 mVA^^ar sensitivity and non-linearity lower than 0.4 %. 

Accelerometers 

Pressure sensors are based on a single membrane, while accelerometers use big 
mass of silicon, connected to the main silicon part that contains the circuitry, with 
small bridges of the same thickness that the membrane of pressure sensors where 
piezoresistors are placed. These structures have been successfully fabricated, and 
preliminary measurements of such device have shown satisfactory operation, 
though the design should be improved in order to get a larger range of operation 
and higher sensitivity. Competition with capacitive- type accelerometers seems to 
be difficult, though the process and integration with simple CMOS technologies is 
easier. 

ISFET Device 

The structure uses floating poly silicon with two metal layers on top (including 
stacked via and contact). Therefore, ISFET fabrication is fully compatible with the 
CMOS process (with some allowed design rule violations) and, consequently, its 
integration together with the technology is straightforward, e.g. Merlos (1995). 

Implemented devices show mean sensitivities of 41 mV/pH. Packaging is done 
on special PCB strips. An isolating material is used for the protection of both 
active devices and bonding area placed around the sensor device (Imm^). 

Figure 1 shows different aspects of those devices: membrane cavity from 
pressure sensors, floating mass of accelerometer, and profile of different layer for 
ISFET devices. Standard size of devices is shown on each SEM photograph. 
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Figure 1. Devices obtained from standard l|im CMOS process plus additional 
processes for pressure sensors (left), accelerometers (center) and ISFET (right). 

3 DESIGN KIT & METHODOLOGY 

ASIC design methodology requires that customers be responsible for, at least, 
giving an structural description of their products. In this case, they only require 
front-end tools to define their designs. Back-end processes are either done by 
customers or by the foundry. Finally, layout generation is usually done by the 
foundry. Nevertheless, sometimes full-custom design is still required. This can 
often happen in the case of sensor design, because of the specificity of the sensor 
devices and their range of operation. 

Design methodology offers the way to go from specifications to the layout. The 
border between design and technology is sensor cell library for the “front-end 
only” design case, and technology file for the complete design flow one. A similar 
approach can be found in Karam (1996), main differences relate the set of sensors 
(only front side etching is allowed) and the framework used (Mentor Graphics). 

3.1 Full-custom design environment: technology file 

Full-custom design requires a set of design rules for the different masks plus the 
physical parameters for the given simulator or device model. Basic tools from a 
framework can easily cope with this enhancements, by adapting the technology file 
for all geometrical definitions and relations, and simulation environment that allow 
the integration or link with additional simulators. 

Sensor layouts are drawn using different geometries depending on the type of 
sensor. For instance, due to anisotropic properties of silicon, drawings on the 
membrane layer are considered different for pressure sensors and accelerometers, 
that include additional relations between membrane and passivation masks. 

Therefore, new technology file covers: (1) the definition of design rules for the 
membrane mask, (2) the redefinition of rules for the rest of layers inside the sensor 
region, (3) and any specific restriction concerning the type of sensor. 

Classical methods for the definition of design rules use creation and alignment 
tolerances depending on the machinery and degree of confidence. For 
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microsystems, most of the values related to new masks are quite large if compared 
to the critical dimensions for standard layers. The geometry imposed by design 
rules related to back side etching is also quite large. This is because most of the 
instrumentation is not optimized for those new processes: double side alignment, 
electrochemical etch stop, whole wafer orientation according to crystallographic 
axes of the silicon, under-etching on back-side mask material. 

All those aspects are taken into account, for each sensor type, in our current 
technology file (standard term for Cadence DFWII) through data concerning layer 
information, device definition, rules for DRC, extraction & LVS rules, and abstract 
generation. 

In the case of the device extractor, and layout versus schematic tool (LVS), the 
sensor cell has to be defined in advance, in order to extract the geometric 
parameters according to the device model. For instance, for the pressure sensor, we 
consider as a library elements both a single piezoresistive element and a symmetric 
Wheatstone bridge of piezoresistors (Figure 2). 

For the floor-plan of microsystems composed of sensor devices and circuitry, 
the role of sensor is also important, since they cannot be placed anywhere. There 
are different types of physical restrictions: distance between sensor and rest of the 
circuit, pads or boundary. First two are related with changes in the behavior of 
electrical and bonding properties, while the last one is related to mechanical 
properties for packaging sensors. 




Figure 2. Symbols for piezoresistive device (left) and Wheatstone bridge (right). 

Furthermore, there are electrical constrains, similar to the classical ones concerning 
mixed A/D ICs such as couplings, parasitic resistances,... that tend to minimize the 
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distance and crossing of sensitive lines, and make the device as much symmetric as 
possible. A layout example for pressure sensor in shown in Figure 3 (a). 

3.2 Semi-custom environment 

ASIC design methodologies are based on handling elements described at higher 
degree of abstraction, and related CAD tools evolved to integrate different 
description levels (behavioral, structural, physical) into the same environment. In 
our case, microsystems contain a reduced number of devices, included in a basic 
sensor cell library of parameterizable devices, and with the appropriate CAD tools. 

Our semi-custom environment also includes analog cells designed to meet 
specific electrical constraints of sensors (low signal level, impedance coupling,...). 

Device model generation 

For the design and improvement of any sensor, device engineers usually work with 
3D structural simulators (i.e. Ansys) or derived environments, e.g. Korvin (1996). 
For instance, they derive some mechanical properties (stress) of mobile structures 
in silicon, and from that stress, taking into account piezoelectric properties of 
materials, it is possible to get their equivalent electric signals for each device. At 
this level, many different types of analyses can be performed (static, dynamic, 
harmonic, thermal,...). 

Different authors propose mixed-mode simulation techniques to integrate this 
kind of simulators with electrical or HDL ones. The drawback of this global 
simulation is that computation through finite element modeling require a large 
amount of memory and CPU time, apart from the economic cost of those 
simulators. 

Therefore, in order to use standard frameworks for system level design, a higher 
degree of abstraction should be used, after the initial modeling of sensor cells for 
the required analysis. These models can be derived either from some analytic 
theory, which can be easily implemented either using mathematical solvers (i.e. 
Mathematica), or from behavioral models. Both approaches require basic process 
parameters, and device parameters. In the end, computational cost and price for the 
customer are made smaller. 

The critical point is the agreement between simulation and measurement. 
However, once process parameters are characterized, differences are not so 
important, and most of sources of non-ideal behavior can be easily introduced in 
the model (e.g. misalignment between piezoresistors and border of membrane). 

This approach has other advantages related CAD tools, like: (1) Portability, 
because C language is used for the sensor models; (2) wide range of application 
since starting from high level specifications (sensitivity, range,...) representations 
at different abstraction levels are obtained, (3) Basic equations for sensors are 
relatively simple (polynomials), and therefore convergence problems are reduced; 
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Figure 3. Pressure sensor: (a) Layout with four piezoresistors inside n-well and 
membrane, and (b) voltage of two outputs (top and center) and bridge (bottom). 
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(4) Different models can be used for the same sensors according to the different 
working conditions (defined for each implication). 

For instance, pressure sensors can be modeled at 4 different levels for static 
analysis, Hebrard (1995): 

• Level 1. Linear plate theory, with first order piezoresistive coefficients and 
simple temperature dependence. 

• Level 2. Precise model for temperature, based in an electrothermal coupling. 

• Level 3. Large defections consideration. Sensor output is represented by a 
look-up table Vout = f(P,T). 

• Level 4. Similar to the previous one, taking into account second order 
piezoresistive coefficients. An example is shown in Figure 3 (b). 
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The choice of the model depends on the pressure range, and on the importance of 
the precision for the application (in order to include second order terms). These 
models use different parameters related to: (1) technology dependencies, used for 
process and device optimization, and fitting between measures and simulation 
results, (2) sensor type, according to specific predefined structures and active 
materials, and (3) design parameters, concerning the geometric values of devices, 
that can be extracted from layout views. Figure 4 shows the from that appears in 
the Cadence framework for model compilation. 

Different models can be generated in a spice-like format and they are integrated 
in Design Framework II. Following a similar scheme Verilog models can be 
produced (currently, a standard for ATMEL-ES2 cell libraries). Difference 
between both types of models relates on the mathematical or tabular functions 
available to describe the model, what is important for sensor design. Nevertheless, 
it is even more important to work at the right level of precision, required for the 
system design. This level will be fixed be the application, and in most of the cases 
will require mixed-mode simulation composed of electrical models for sensor and 
analog parts, and HDL models for digital ones. 
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Figure 4. Fill form with pressure sensors characteristics inside Cadence DFWII. 
Sensor cell library 

At this point, it is important to note our main objective. We did not develop a 
complete new process, conversely we started from a standard process (1pm from 
ATMEL-ES2) and introduced some additional steps. In this way, we enable the 
integration of different types of sensors together with all previously designed and 
characterized cells, that includes analog, digital pads, compiled structures, soft 
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macros, etc. As a consequence, any new addition to the former design kit has to 
preserve its capabilities. 

The concept of specific sensor cell library, under Cadence Design Framework II, 
has been developed in two directions: the generation of a set of predesigned an 
precharacterized sensors that cover specific ranges of applications, and the 
automatic generation tool for sensors. All representations required can be found for 
every cell (schematic, symbol, layout, extracted, abstract, spice, verilog). 

Simulation environment 

An important point in defining a design methodology for ASIS (Application 
Specific Integrated Sensors), is the simulation environment, and their connection 
with test and calibration. Those subjects are still open questions for microsystems. 

At circuit simulation level, the number of signal sources is limited (one for each 
physical magnitude to measure) and the dynamic range of devices is quite low 
compared to MOS transistors (tenths of kHz). The usual analysis performed are 
DC, AC, transient and noise. Following our approach, some simulations are 
performed only analog, and system level simulations, containing digital cells, are 
performed with mixed-signal tools. 
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The choice not only depend on the availability of tools but also on the design kit 
offered by ATMEL-ES2. In this case, both analog and digital cells are represented 
by digital models, and basic simulation tool is based on Verilog. Then, we use 
mixed mode simulation for sensors and ad-hoc designed analog cells. In the case 
of specific design constrains, asking the foundry for spice models is required. 

In terms of high-level simulation, we give basic Verilog models for sensors, but 
we mainly try to exploit mixed-signal environment. Analog Artist, in DFWII, by 
using electrical models for both sensors and analog cells (cdsspice, spectre or 
hspice) and structural for digital ones (Verilog), then some modifications to the 
current design kit were required. Figure 5 shows an example of such simulation. 




564 Part Twelve Testing in Complex Mixed Analog and Digital Systems 



4 CIRCUIT EXAMPLES 

For the ISFET case, a standard ISFET-meter structure has been built with 
ATMEL-ES2 analog cells. For this preliminary structure, sensor device and 
circuitry are placed one next to the other, as shown in Figure 6. 




Figure 6. Photograph of the pH-meter composed of sensors (right) and circuit (left) 

Measurements of pH show acceptable results for this device, for range of operation 
(Figure 7) and sensitivity, 41 mV/pH. The circuit is working according to specs. 

Atmel-ES2 Oxi nitride ISFET 




pH 



Figure 7. Measurement of range and sensitivity for fabricated ISFET devices. 

They follow two different approaches. For the first one, 2*3 mm^ (Figure 8), 
analog and digital cells from ATMEL-ES2 have been used, and for the second one, 
2.7 *2.6 mm\ specially developed analog cells using current-mode techniques were 
used (designed by CNM Sevilla). Both microsystems perform pressure sensing, 
analog preprocessing, analog-to-digital conversion and digital coding in order to 
interface an standard microcontroller, 8051 -like. Special plastic package has been 
developed by the customer, COPRECI, for this project. 
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Figure 8. Photography of the integrated microsystem based on a pressure sensor. 

The application environment is a prototype washing machine, composed of a water 
tank, the pressure sensor to obtain that measures the amount of water present in the 
tank, and the microcontroller. Preliminary measurements done by the industry, 
shows a resolution of 6 bits. Further advantages are noise reduction due to signal 
amplification closer to the sensor, and better control provided by digital coding for 
data transmission. As a consequence, industrialization is being considered. 

5 CONCLUSIONS 

Success in industrialization of integrated microsystems require to show the 
advantages of monolithic solutions: low extra cost compared with standard ASICs, 
reduction of area, better signal to noise ratio, few number of industries involved in 
the fabrication of the components of the whole system. 

For a wide diffusion of this alternative, standardization of foundry processes and 
definition a clear design methodology is required. The objective of our work has 
been the development of specific cell libraries and design tools for the design of such 
microsystems. The manufacturing process is based on a standard CMOS technology, 
1.0pm from ATMEL-ES2, and additional post-processing steps, done by CNM. 

New concepts are required for the system design. Our approach tried to use as far 
as possible the same concepts used in mixed-signal circuit design. So, a standard 
commercial CAD tool. Cadence DFWII framework, is used. 

A bottom-up strategy to develop process dependent data is proposed. Starting 
from the definition of the collection of design rules and basic set of cells, at full- 
custom level. Following by the development and integration of a specific tool for 
device model compilation. And ending by the definition of a design methodology. 
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By using this methodology several devices and microsystems have been 
designed and tested. Two of them, for pH and pressure measurements are 
presented together with some experimental results. Both show a correct behavior 
and allowed us to confirm the advantages of this new approach. 
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